A  SYSTEM  FOR  TIME  MODIFICATION  OF  SYNTHESIZED  SPEECH 


By 
JOHN  McLEAN  WHITE  IH 


A  DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 

OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTL\L  FULFILLMENT 

OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 

DOCTOR  OF  PHILOSOPHY 

UNIVERSITY  OF  FLORIDA 

1995 


To  Marie  Grande  White, 
the  most  courageous  person  I  have  ever  known. 


ACKNOWLEDGMENTS 

I  am  forever  grateful  to  my  advisor  and  committee  chairman,  Dr.  Donald  G. 
Childers.  For  five  long  years,  he  has  been  an  untiring  source  of  guidance,  direction,  and 
knowledge.  His  dauntless  dedication  to  students  like  myself  is  highly  commendable. 

I  am  grateful  to  Dr.  Howard  Rothman  for  his  insightful  observations  regarding  the 
science  of  speech.  He  was  also  instrumental  in  the  development  and  success  of  the  listening 
tests.  His  willingness  to  repeatedly  participate  as  the  lone  speech  scientist  in  a  group  of 
electrical  engineers  is  admirable,  and  is  gready  appreciated. 

In  addition,  I  would  like  to  thank  Dr.  Jose  Principe,  Dr.  Scott  Miller,  and  Dr.  Fred 
Taylor  for  serving  on  my  committee.  I  would  also  like  to  thank  all  of  the  past  and  present 
members  of  the  Mind-Machine  Interaction  Research  Center  for  their  help  and 
understanding. 

This  research  was  funded  by  the  Mind-Machine  Interaction  Research  Center  and 
the  Audio  Engineering  Society  Educational  Foundation. 


lU 


TABLE  OF  CONTENTS 

ACKNOWLEDGMENTS iii 

ABSTRACT viii 

CHAPTERS 

1  INTRODUCTION 1 

1.1  History  of  Tune  Modification  Methods 2 

1.1.1  Variable-Playback-Rate  Method    2 

1 . 1 .2  Sampling  Method 3 

1.1.3  Vocoder  Methods 5 

1.1.4  Recent  Methods  6 

1.2  Phonological  versus  Psychological  Testing 8 

1.3  Review  of  Research 11 

1.3.1  Quantitative  Measures  of  Time  Modification   11 

1.3.2  Phonological  Tests 12 

1.3.3  Psychological  Tests   14 

1.3.3.1  Intelligibility  tests  and  influencing  factors 15 

1.3.3.2  Comprehensibility  tests  and  influencing  factors 17 

1.4  Motivation   20 

1.5  Goals 21 

1.6  General  System  Description 23 

1.7  Chapter  Ch-ganization 25 

2  SPEECH  ANALYSIS,  SEGMENTATION,  AND  LABELING  26 

2.1  Selection  of  Speech  Segment  Categories 27 

2.2  Overview  of  Automatic  Segment  Detection    28 

2.3  Feature  Detection  Algorithms — General  Development 32 

2.3.1  Input  Data  and  VAJ/S  Pre-Processing  33 

2.3.2  Volume  Function    34 

2.3.3  Fixed  Thresholds  and  Feature  Scores 36 

2.3.4  Automatic  Correction  Rules 40 

2.3.5  Summary    43 

2.4  Feature  Detection  Algorithms — ^Detailed  Descriptions 43 

2.4.1  Sonorant  Detection 45 

2.4.2  Vowel  Detection 47 

2.4.3  Voiced  Consonant  Detection    49 

2.4.4  Voice  Bar  Detection 50 

2.4.5  Formant  Tracking 52 


IV 


2.4.6  Nasal  Detection   56 

2.4.7  Semivowel  Detection 58 

2.4.8  Voiced  Fricative  Detection 62 

2.4.9  Unvoiced  Stop  and  Fricative  Detection 63 

2.5  Speech  Segmentation 68 

2.5.1  Spectral-Based  Boundary  Detection  and  Segmentation 68 

2.5.2  V/U/S  Boundary  Detection    69 

2.5.3  Final  Segmentation 71 

2.6  Segment  Labeling 74 

2.7  Manual  Modification  of  Automatic  Segmentation  and  Labeling 

Results    76 

2.7.1  Description  of  Errors  78 

2.7.2  Software  and  GUI  for  Manual  Modification  81 


3    TIME  MODIFICATION  ALGORITHMS  AND  USER  INTERFACE 88 

3.1  The  Linear  Prediction  Coding  (LPC)  Speech  Synthesizer 88 

3.2  Time  Modification  Basics — Frame  Skipping  and  Frame  Doubling    ....  89 

3.3  User-Specified  Modification  Parameters 93 

3.4  Mapping 94 

3.5  Time  Modification  and  Synthesis 98 

3.6  Glitch  Prevention 101 

3.7  Graphical  User  Interface  (GUI) 105 

3.7.1  Main  Window 105 

3.7.2  Preview  Window   108 

3.7.3  Scale  Factors  Window  108 

3.7.4  Minimum  Durations  Window 110 

3.7.5  Manual  Scale  Factors  Window Ill 

3.7.6  MapWmdow  113 

3.7.6.1  Map  Display  window  114 

3.7.6.2  Map  Edit  window  117 

3.7.7  Postview  Window 120 

3.8  Summary   120 


4    LISTENING  TESTS   122 

4.1  Word-Length  versus  Sentence-Length  Test  Tokens 122 

4.2  Pilot  Studies 124 

4.2.1  Quality  126 

4.2.2  Nasals 126 

4.2.3  Stops 127 

4.2.4  Fricatives   129 

4.3  Development  of  the  Formal  Listening  Test 130 

4.3.1  Test  Tokens:  131 

4.3.1.1  Type 131 

4.3.1.2  Duration 133 

4.3.1.3  Position 138 

4.3.1.4  Synthesis  and  time  resolution 143 

4.3.2  Test  Format 146 


4.3.3   Listeners  and  Listening  Environment   149 

4.3.3.1  Type 149 

4.3.3.2  Number 149 

4.3.3.3  Training 151 

4.3.3.4  Screening 151 

4.4  Results  of  the  Formal  Listening  Test 153 

4.4. 1  Perception  of  the  Time- Modified  Variations  of  the 

Word  "Sue" 159 

4.4.2  Perception  of  the  Time-Modified  Variations  of  the 

Word  "Zoo"    160 

4.4.3  Perception  of  the  Time-Modified  Variations  of  the 

Word  "Said"  161 

4.4.4  Perception  of  the  Time- Modified  Variations  of  the 

Word  "Zed" 162 

4.4.5  Summary  of  Answers  Selected  Most  Often 163 

4.5  Summary   166 

5    DISCUSSION  OF  THE  FORMAL  LISTENING  TEST  RESULTS 169 

5.1  Perception  of  the  Time-Modified  /s/  169 

5.1.1  The  Word  "Sue" 169 

5.1.1.1  Tokens  that  preserved  the  beginning  of  the  /s/ 170 

5.1.1.2  Tokens  that  preserved  the  middle  of  the /s/ 173 

5.1.1.3  Tokens  that  preserved  the  end  of  the  /s/  175 

5. 1 . 1 .4  Comparison  of  the  results  as  a  function  of  position  .  177 

5.1.2  The  Word  "Said"  178 

5.1.2.1  Tokens  that  preserved  the  beginning  of  the  /s/ 178 

5.1.2.2  Tokens  that  preserved  the  middle  of  the /s/ 180 

5.1.2.3  Tokens  that  preserved  the  end  of  the  /s/ 181 

5. 1 .2.4  Comparison  of  the  results  as  a  function  of  position   .  181 

5.1.3  Summary  for  "Sue"  and  "Said" 182 

5.2  Perception  of  the  Time-Modified  M  188 

5.2.1  The  Word  "Zoo" 188 

5.2.1.1  Tokens  that  preserved  the  beginning  of  the  /z/ 188 

5.2. 1.2  Tokens  that  preserved  the  middle  of  the  /z/ 192 

5.2.1.3  Tokens  that  preserved  the  end  of  the  /z/ 193 

5.2. 1 .4  Comparison  of  the  results  as  a  function  of  position  .  194 

5.2.2  The  Word  "Zed" 195 

5.2.2.1  Tokens  that  preserved  the  beginning  of  the  /z/ 195 

5.2.2.2  Tokens  that  preserved  the  middle  of  the  /z/ 197 

5.2.2.3  Tokens  that  preserved  the  end  of  the  /z/ 198 

5.2.2.4  Comparison  of  the  results  as  a  function  of  position  .  198 

5.2.3  Summary  for  "Zoo"  and  "Zed"   199 

5.3  General  Observations 205 


6    SUMMARY  AND  CONCLUSIONS    207 

6. 1    Summary   207 


VI 


6.1.1  The  Time  Modification  System 207 

6.1.2  The  Listening  Tests    209 

6.2  Recommendations  for  Further  Work 213 

6.2.1  Additional  Listening  Tests 213 

6.2.2  Enhancements  to  the  Time  Modification  System 214 


APPENDICES 

A  THE  DIAGNOSTIC  RHYME  TEST  WORD  LIST 217 

B   LISTENING  TEST  INSTRUCTIONS   218 

C   FORMAL  LISTENING  TEST  RESULTS    220 

REFERENCES    233 

BIOGRAPHICAL  SKETCH 240 


vu 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 

of  the  University  of  Florida  in  Partial  Fulfillment  of  the 

Requirements  for  the  Degree  of  Doctor  of  Philosophy 

A  SYSTEM  FOR  TIME  MODIHCATION  OF  SYNTHESIZED  SPEECH 

By 

John  McLean  White  III 
May,  1995 

Chairman:  Dr.  Donald  G.  Childers 
Major  Department:  Electrical  Engineering 

The  aim  of  this  research  was  twofold.  The  first  goal  was  to  create  a  software-based, 
time  modification  system  to  independently  and  automatically  modify  the  durations  of  the 
phonetic  segments  in  a  speech  signal.  The  system  was  intended  to  be  used  to  create  high 
quality  test  tokens  for  use  in  speech  perception  studies.  The  second  goal  was  to  use  this 
system  to  investigate  the  role  of  duration  in  perception  of  the  fricatives  /s/  and  /z/  in 
word-initial  position  in  four  single-syllable  words. 

The  first  portion  of  the  time  modification  system  analyzes  the  speech  signal.  The 
signal  is  divided  into  phoneme- type  segments,  and  each  segment  is  labeled  as  either  vowel, 
semivowel,  nasal,  voice  bar,  voiced  fricative,  unvoiced  fiicative,  unvoiced  stop,  or  silent. 
These  segmentation  and  labeling  algorithms  are  based  primarily  on  the  short-term 
frequency  distribution  of  the  speech  signal. 

The  second  portion  of  the  time  modification  system  invokes  a  graphical  user 
interface  that  allows  the  user  to  specify,  via  slide-bar  controls,  both  the  desired  time  scale 
factor  and  minimum  duration  for  each  segment.  The  user  can  also  specify  a  weighting 
function,  or  "map,"  for  each  segment.  The  map  determines  the  portion  of  the  segment  that 


vm 


is  modified.  The  resulting  time-modified  speech  is  created  by  a  linear  predictive  coding 
speech  synthesizer. 

The  system  was  used  to  modify  the  duration  of  the  initial  consonant  in  the  words 
"sue,"  "zoo,"  "said,"  and  "zed."  The  duration  was  adjusted  in  10-ms  increments,  and 
ranged  from  zero  ms  to  the  original,  unmodified  duration  (approximately  240  ms).  For 
each  duration,  three  tokens  were  created:  The  first  preserved  the  beginning  of  the 
consonant,  the  second  preserved  the  middle  of  the  consonant,  and  the  third  preserved  the 
end  of  the  consonant.  A  total  of  270  tokens  were  created. 

Formal  listening  tests  showed  that  duration  strongly  affected  perception  of  the 
initial  consonant.  In  addition,  the  portion  (i.e.  beginning,  middle,  or  end)  of  the  consonant 
that  was  preserved  was  also  shown  to  affect  perception  of  many  of  the  test  tokens. 
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CHAPTER  1 
INTRODUCTION 


For  certain  speech  applications,  it  is  desirable  to  change  the  rate  at  which  recorded 
or  synthesized  speech  is  presented  to  a  listener.  One  example  of  this  is  a  device  that  varies 
the  playback  rate  of  audio  books  for  the  blind.  This  allows  the  non-sighted  listener  to 
"read"  at  his  or  her  own  speed,  independent  of  the  rate  at  which  the  recording  was  originally 
made. 

One  of  the  driving  forces  behind  research  of  time-modified  speech  is  the  fact  that 
it  has  long  been  known  that  a  human  being  can  comprehend  speech  at  a  rate  greater  than 
he  or  she  can  produce  speech  (de  Haan,  1982;  Foulke  and  Sticht,  1969;  Goldman-Eisler, 
1968;  Goldstein,  1940).  Therefore,  a  significant  time  savings  results  by  increasing  the  rate 
of  pre-recorded  speech  in  applications  such  as  playback  of  academic  lectures,  conference 
papers,  religious  sermons,  and  archived  political  speeches,  just  to  name  a  few.  Because  of 
this  difference  between  the  maximum  speaking  and  perception  rates,  the  large  majority  of 
published  research  has  studied  speech  compression  ("speeded  up"  speech)  as  opposed  to 
speech  expansion  ("slowed  down"  speech).  While  there  are  applications  for  speech 
expansion,  these  are  far  fewer  in  number.  The  most  common  are  expansion  of  speech  for 
the  hearing  impaired  or  for  foreign  language  learning. 

Time-modified  speech  also  has  application  as  a  significant  research  tool  for  the 
development  of  test  data  for  use  in  perceptual  studies  of  both  normal  and  pathological 
patients.  The  durations  of  different  portions  of  the  speech  signal  are  modified  in  order  to 
test  theories  of  speech  perception  from  either  a  psychological  or  phonological  viewpoint. 

In  addition,  time  modification  can  be  used  to  create  larger  databases  for 
development  of  speech  recognition  systems.    Many  different  variations  (in  terms  of 


duration)  of  a  test  word  or  sentence  can  be  systematically  created  from  a  single  token.  The 
different  variations  can  then  either  be  used  to  further  train  the  system,  or  to  perform 
controlled  tests  of  the  system's  ability  to  correctly  detect  data  different  from  the  training 
data. 

1 . 1  History  of  Time  Modification  Methods 

The  methods  used  to  accomplish  time  modification  have  evolved  through  several 
stages  over  the  last  40  years.  This  section  provides  some  history  and  background  into  these 
methods.  Although  there  are  variations  in  the  specific  implementations,  almost  all  of  the 
methods  used  to  date  can  be  assigned  to  one  of  four  main  categories. 

1.1.1  Variable-Playback-Rate  Method 

The  variable-playback-rate  method  is  relatively  simple,  and  accomplishes  a  rate 
change  by  playing  back  previously  recorded  speech  at  a  rate  different  finom  the  original 
recording  rate.  An  example  of  this  is  playing  an  LP  phonograph  record  at  45  RPM,  instead 
of  its  intended  rate,  33  1/3  RPM.  A  modem,  digital  signal  processing  (DSP)  analogy  of 
this  is  digital-to-analog  conversion  (and  appropriate  filtering)  at  a  sampling  rate  different 
from  the  signal's  original  sampling  rate,  without  interpolation  or  decimation.  Note  that  for 
this  technique,  the  rate  change  is  always  accompanied  by  a  linear  shift  in  the  frequency 
content  of  the  signal.  For  speech  that  has  been  slowed  down  or  speeded  up  by  a  factor  of 
about  two  or  more,  this  frequency  shift  leads  to  undesirable  perceptual  effects  that  mask 
the  identity  of  the  speaker  and,  in  general,  cause  a  decrease  in  intelligibility  (Garvey,  1953a; 
Tiffany  and  Bennett,  1961). 

The  variable-playback-rate  method  was  never  popular  among  researchers,  mainly 
because  of  the  detrimental  perceptual  effects  attributed  to  the  accompanying  pitch  shift. 
However,  it  was  the  only  method  available  until  about  1950,  when  the  sampling  method 
was  introduced. 


1.1.2  Sampling  Method 

The  general  class  of  rate-change  techniques  known  as  the  "sampling  method" 
involves  the  periodic  removal  or  duplication  of  small  segments  of  recorded  speech.  The 
remaining  segments  are  then  spliced  together  to  form  the  rate-altered  speech.  The  main 
advantage  of  this  method  is  that  the  frequency  content  of  the  resulting  speech  is  not 
affected,  and  as  a  result,  many  of  the  speaker-dependent  characteristics  are  preserved.  It 
has  also  been  shown  that  for  a  variety  of  different  rates,  the  speech  produced  by  this  method 
is  significantly  more  intelligible  than  speech  produced  by  the  variable-playback-rate 
method  (Fletcher,  1929;  Garvey,  1951;  Lee,  1972). 

One  way  of  implementing  the  sampling  method  is  by  manually  cutting  and  splicing 
magnetic  recording  tape  (Garvey,  1949).  The  disadvantage  of  this  method  is  the  time 
required  to  perform  the  task.  Another  disadvantage  is  that  it  is  impossible  to  guarantee 
waveform  continuity  across  the  splice  boundary.  This  results  in  audible  "pops"  and 
"clicks,"  although  these  clicks  can  be  reduced  by  cutting  the  tape  at  a  45  degree  angle 
relative  to  the  edge  of  the  tape.  This  method  does,  however,  have  the  advantage  of  allowing 
the  user  to  (manually)  select  the  locations  and  durations  of  the  discarded  segments.  To 
mark  the  tape,  the  operator  manually  passes  the  tape  back  and  forth  across  the  playback 
head  of  a  tape  recorder  and  listens  for  the  starting  and  stopping  points  of  a  syllable.  Once 
these  points  are  found,  they  are  marked  and  labeled  on  the  back  of  the  tape  with  a  grease 
pencil.  This  method  is  flexible,  but  extremely  time  consuming.  As  a  result,  it  is  often 
impractical  for  extensive  use  and  is  typically  used  only  for  "proof  of  concept." 

Today,  this  "cut  and  splice"  method  is  typically  implemented  on  a  digital  computer 
and  a  CRT  display.  Although  the  physical  inconvenience  of  the  magnetic  tape  is  no  longer 
present,  the  problems  of  manual  segment  identification  and  waveform  continuity  across 
the  splice  boundary  still  remain. 


An  automatic  time  modification  method  exists  that  is  based  on  a  modified  magnetic 
tape  recorder  (Fairbanlcs  et  al.,  1954).  It  involves  a  rotating  playback  head  assembly  that 
contains  four  playback  heads.  As  the  head  assembly  rotates,  each  one  of  the  four  playback 
heads  individually  and  sequentially  contacts  the  moving  magnetic  tape.  The  outputs  from 
the  four  heads  are  wired  in  parallel.  Time  compression  results  from  the  fact  that  there  are 
gaps  between  each  of  the  four  heads.  Therefore,  as  the  head  assembly  rotates,  every 
segment  of  the  tape  that  these  gaps  contact  (instead  of  a  playback  head)  are  not  reproduced 
by  the  playback  head.  The  spacings  in  the  head  assembly  are  calculated  so  that  as  the  head 
assembly  turns,  one  playback  head  is  always  beginning  to  make  contact  with  the  tape  just 
as  the  previous  playback  head  is  loosing  contact  with  the  tape.  The  output  of  the  rotating 
head  assembly  is  re-recorded  onto  a  second,  conventional,  tape  recorder.  The  second  tape 
then  contains  the  compressed  speech. 

The  initial  automatic  time  modification  device  introduced  by  Fairbanks  in  1954  has 
some  major  limitations.  The  biggest  problem  is  that  the  duration  of  the  segments  that  are 
discarded  or  duplicated  is  fixed.  This  was  changed  in  later  adaptations  and  copies  of  the 
original  machine  (Lee,  1972;  Neuburg,  1978).  However,  other  problems  ultimately  limit 
the  fidelity  and  usefulness  of  the  machine.  The  first  of  these  problems  is  the  lack  of 
repeatability  of  the  process.  In  order  to  repeat  the  process  of  compressing  a  tape  segment, 
tiie  user  has  to  know  exactiy  tiie  starting  phase  of  the  rotating  head  assembly  with  respect 
to  the  beginning  of  the  tape.  The  second  problem  is  the  noise  created  by  the  rotating  head 
assembly's  slip  rings  and  distortion  due  to  the  heads'  misalignment  with  the  moving  tape. 
Despite  the  shortcomings  of  the  method,  the  large  majority  of  published  research 
on  the  intelligibiUty  and  comprehensibility  of  compressed  speech  was  done  using  the 
sampling  method. 


1.1.3  Vocoder  Methods 

A  third  class  of  rate  change  techniques  is  accomplished  by  the  use  of  vocoders 
(VOice  CODERS).  Vocoders  were  originally  designed  to  reduce  the  bandwidth 
requirements  for  transmission  of  a  normal  voice  signal.  Their  ability  to  modify  the  rate  of 
speech  is  thought  of  as  a  secondary  benefit.  Of  all  of  the  vocoders,  the  phase  vocoder  is 
the  best  suited  for  rate  modification  (Flanagan  and  Golden,  1966). 

Vocoders  implement  an  analysis-synthesis  speech  transmission  scheme.  In  the 
analysis  stage,  natural  speech  is  analyzed,  typically  by  a  bank  of  bandpass  filters.  The 
output  of  each  bandpass  filter  in  the  bank  is  coded  by  one  of  a  variety  of  different  methods, 
and  this  coded  information  is  transmitted  across  a  channel.  At  the  synthesis  stage  at  the 
receiving  end  of  the  channel,  the  coded  information  is  decoded,  and  is  used  to  control  a 
bank  of  tuned  oscillators.  The  outputs  of  the  oscillators  are  then  summed  to  produce 
synthesized  speech  (Rabiner  and  Schafer,  1978). 

Typically,  the  synthesis  oscillators  are  tuned  to  the  same  frequencies  as  the 
bandpass  filters  in  the  analysis  stage.  However,  this  one-to-one  match  in  tuning  is  not 
strictly  required,  and  if  the  oscillator  frequencies  are  tuned  to  multiples  of  the  analysis 
stage's  bandpass  filters,  it  is  possible  to  implement  a  modification  of  the  synthesized 
speech.  For  example,  the  phase  vocoder  can  be  used  to  implement  a  rate  change  in  the 
following  two-stage  manner:  In  the  first  stage,  speech  is  analyzed  by  a  bank  of  equally 
spaced  bandpass  filters  with  center  frequencies  at  o)i,  fori  G  {1,2,3,...,N}.  Theoutputs 
of  the  bank  of  bandpass  filters  are  then  used  to  control  a  bank  of  oscillators  tuned  to  center 
frequencies  (Wi/ 2),  fori  G  {1,2,3,...,N}.  At  this  point,  the  rate  of  the  synthetic  speech 
is  identical  to  that  of  the  original  speech,  but  the  frequency  spectrum  of  the  synthetic  speech 
is  shifted  down  to  one-half  that  of  the  original  speech.  The  second  stage  of  the  process  is 
to  double  the  playback  speed  of  die  speech  synthesized  by  the  first  stage.  The  resulting 


speech  is  twice  the  rate  of  the  original  speech,  and  has  the  same  spectrum  as  the  original 
speech. 

While  vocoders  are  able  to  modify  the  rate  of  speech,  they  suffer  from  the  fact  that 
their  analysis-synthesis  schemes  create  unwanted  artifacts  in  the  speech  signal  (Pormoff, 
1981).  The  speech  produced  by  vocoders  is  often  described  as  sounding  artificial  or 
"buzzy."  Another  problem  previously  associated  with  the  use  of  vocoders  for  research 
applications  is  that  in  the  1960s  and  1970s,  vocoders  were  relatively  expensive  and  not  a 
cost-effective  option  for  many  researchers. 

The  literature  shows  that  vocoders  were  seldom  used  in  experiments  on  the 
intelligibility  of  rate-altered  speech.  The  reasons  for  this  are  probably  due  to  the  problems 
listed,  as  well  as  the  fact  that  vocoders  were  unavailable  when  the  peak  in  the  research 
interest  in  rate-altered  speech  occurred  (late  1950s  and  throughout  the  1960s).  Note, 
however,  that  vocoders  have  been  studied  extensively  for  speech  that  has  not  been  rate 
altered.  Many  of  the  low-bit-rate  communication  schemes  in  use  today  employ  basic 
bandpass  filter  concepts  first  introduced  in  the  early  vocoders  (Jayant,  1990). 

1.1.4  Recent  Methods 

There  has  been  continued  interest  over  the  last  10  to  15  years  into  newer  methods 
of  modifying  the  rate  of  speech.  While  the  speech  produced  by  these  methods  is  seldom 
tested  in  formal  intelligibility  or  comprehensibility  tests,  the  methods  are  being  studied  due 
to  their  low  cost,  low  computational  requirement,  and  relative  ease  of  implementation. 
Note  that  some  of  the  newer  methods  are  hybrids  of  older  vocoder  technology  and  recent 
waveform  coding  technology. 

The  simplest  new  method  consists  of  "a  pitch  detector  followed  by  an  algorithm 
that  discards  (or  repeats)  pieces  of  speech  equal  in  length  to  a  pitch  period"  (Neuburg, 
1978).  This  is  a  minor  variation  of  the  sampling  method.  The  method  does  not  operate 
pitch-synchronously,  meaning  that  the  beginning  of  the  segment  tiiat  is  either  duphcated 


or  discarded  does  not  occur  at  die  instant  of  glottal  closure.  The  method  relies  upon  the 
fact  that  for  the  majority  of  time,  the  speech  signal  does  not  vary  greatly  across  a  single 
pitch  period  (about  10  ms  for  a  male  speaker).  Therefore,  as  long  as  the  duration  of  the 
discarded  (or  repeated)  segment  is  exactly  equal  to  the  pitch  period,  the  ear  can  not  discern 
any  significant  distortion  from  the  process.  No  formal  listening  tests  have  been  conducted 
for  this  method. 

Another  speech  rate  modification  method  by  Malah  is  similar  in  principle  to  the 
two-step  process  implemented  by  the  phase  vocoder  described  in  Section  1.1.3  (Malah, 
1979).  For  both  of  these  methods,  if  the  speech  rate  is  modified  by  an  integer  scale  factor 
n,  the  first  step  shifts  the  frequency  spectrum  by  a  factor  of  n,  and  the  second  step  plays 
the  frequency-shifted  speech  at  a  speed  of  n  times  the  original  rate.  Since  the  second  step 
of  this  process  is  essentially  trivial,  the  success  of  this  method  relies  on  the  ability  to  shift 
the  frequency  spectrum  of  the  speech.  Malah  developed  numerically-efficient  algorithms 
that  can  shift  the  frequency  spectrum  of  speech  (without  changing  the  rate).  These 
algorithms  are  known  as  time-domain  harmonic  scaling  (TDHS)  algorithms.  In  most 
cases,  the  algorithms  require  only  one  multiplication  and  two  additions  per  output  sample 
of  speech.  The  primary  problem  in  this  implementation  is  that  rate  modification  is  only 
implemented  in  integer  multiples  (i.e.  2:1,  3:1).  Thus,  only  a  small,  finite  set  of 
compression  and  expansion  ratios  can  be  implemented.  Malah  claims  that  for  his  two-step 
rate  modification  scheme,  rates  of  greater  than  2:1  are  impractical,  "due  to  perceptual 
limitations."  Because  of  this  limitation  to  integer  multiples,  the  algorithm  is  not  that 
applicable  to  speech  research.  Although  no  formal  tests  were  conducted,  the  author  states 
that  "Simulation  results  with  a  scaling  factor  of  two  for  different  speakers  and  texts  have 
been  informally  judged  to  be  very  good  ...." 

A  recent  and  popular  approach  for  time  modification  is  based  upon  the  short-term 
Fourier  transform  (STFT).  The  method  is  composed  of  three  parts.  The  first  pan  models 
the  speech  signal  with  the  STFT.    The  second  part  modifies  the  STFT  parameters  to 


implement  the  rate  change.  The  third  part  synthesizes  the  modified  speech  signal  from  the 
modified  STFT  parameters  (Ponnoff,  198 1 ).  The  method  was  simulated  on  a  DEC  PDP- 1 1 
computer.  While  no  formal  listening  tests  were  performed,  the  author  claims  that  the 
system  "is  capable  of  producing  high  quality  rate-changed  speech ...  for  compression  ratios 
as  high  as  3 : 1  and  expansion  ratios  as  high  as  4: 1 ."  For  ratios  outside  this  range,  the  method 
introduces  reverberation  for  expanded  speech,  and  exhibits  a  "rough"  quality  for 
compressed  speech. 

Another  recent  approach  models  the  speech  signal  by  a  set  of  time-varying  sine 
waves  (Quatieri  and  McAulay,  1992).  In  terms  of  the  general  procedure,  this  method  is 
very  similar  to  the  STFT  approach.  The  speech  is  modeled  by  a  set  of  time-varying 
parameters,  namely  sine  wave  amplitudes  and  phases.  The  algoridim  adjusts  the  speech 
parameters,  and  then  re- synthesizes  the  modified  speech  by  controlling  a  set  of  sine  wave 
generators.  Again,  no  formal  listening  tests  were  conducted,  and  the  authors  state  that  for 
time-modified  speech,  "the  synthesized  speech  was  generally  natural  sounding  and  free  of 
artifacts." 

Perhaps  the  most  interesting  point  in  the  Quatieri  and  McAulay  study  is  that  they 
experiment  with  what  they  call  "speech-adaptive  time-scale  modification."  In  essence, 
they  implement  rate  change  by  modifying  only  the  voiced  portions  of  speech.  In  addition, 
they  measure  the  "degree"  of  voicing,  and  concentrate  their  time-base  modification  on  the 
frames  that  exhibit  the  highest  degree  of  voicing.  Note  that  their  measurement  of  the  degree 
of  voicing  is  based  upon  how  little  the  harmonic  structure  varies  across  multiple 
frames — ^the  less  the  harmonics  vary,  the  higher  the  "degree  of  voicing."  However,  no 
formal  listening  tests  were  conducted  to  test  their  results. 

1.2  Phonological  versus  Psychological  Testing 

Time  modification  is  a  research  methodology  for  studying  certain  aspects  of 
speech.  Time  modification  of  speech  is  used  in  a  multitude  of  tests  in  several  interrelated 


research  areas.  In  this  study  the  research  on  time  modification  techniques  are  grouped 
according  to  two  different  points  of  view:  "psychological  testing"  and  "phonological 
testing."  These  terms  categorize  the  research  according  to  its  ultimate  purpose. 

It  is  important  to  emphasize  that  the  definitions  given  in  this  study  for  psychological 
and  phonological  testing  are  provided  solely  to  divide  the  large  number  of  applications  of 
time-modified  speech  into  more  manageable  categories.  Strictly  speaking,  there  is  an 
overlap  of  the  two  definitions.  For  example,  it  is  incorrect  to  assume  from  the  following 
definition  of  phonological  testing  that  speech  pathologists  are  unconcerned  with  the  speech 
perception  process.  Consequently,  not  all  of  the  studies  labeled  here  as  psychological  tests 
were  conducted  by  psychologists— several  were  conducted  by  speech  pathologists.  In 
general,  both  psychologists  and  speech  pathologists  examine  the  perception  process, 
although  each  group  tends  to  focus  on  different  parts  of  the  process,  or  at  least  approach 
it  from  a  different  point  of  view. 

Despite  the  above  problems,  the  studies  described  as  psychological  tests  in  this 
report  are  usually  conducted  by  psychologists.  One  goal  is  the  investigation  of  accelerated 
learning  rates.  The  ultimate  goal,  however,  is  to  create  an  accurate  model  of  the  speech 
perception  process.  Researchers  often  attempt  to  link  the  listening  test  results  with  higher 
cognitive  processes,  and  not  surprisingly,  the  results  are  often  in  disagreement. 

In  general,  psychological  testing  is  concerned  with  issues  including  (1) 
determination  of  the  highest  rate  at  which  speech  can  be  presented  to  a  Ustener  and  still  be 
understood,  (2)  an  explanation  of  why  our  perception  process  fails  at  higher  speaking  rates, 
and  (3)  the  role  of  short-term  and  long-term  memory  in  speech  perception. 

Psychological  testing  utilizes  rate  change  over  relatively  large  time  intervals, 
typically  sentences  or  paragraphs.  The  results  are  measured  in  terms  of  either  intelligibility 
or  comprehensibility.  The  term  "intelligibility"  is  defined  as  a  measure  of  the  abiUty  to 
repeat  a  short  word,  phrase,  or  sentence  (Carlson  et  al.,  1979).  For  example,  suppose  a 
listener  is  presented  with  a  list  of  time-compressed  words.  After  each  word,  he  or  she  is 
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asked  to  write  down  or  speak  each  word  that  he  or  she  heard.  The  percentage  of  correct 
responses  is  the  measure  of  intelligibility.  The  term  "comprehensibility"  is  defined  as  the 
listener's  ability  to  answer  a  detailed  set  of  questions  about  a  passage  of  rate-altered  text 
(Foulke,  1968;  Heiman  et  al.,  1986).  The  test  differs  from  intelUgibility  in  that  the 
listener's  understanding  and  comprehension  of  the  material  is  tested,  not  just  the 
intelligibility.  Of  course,  these  factors  are  related.  For  example,  if  a  single  word  in  a 
passage  of  text  is  unintelligible,  it  may  (or  may  not,  depending  upon  the  information 
content  of  the  word)  affect  the  comprehensibility  of  the  passage. 

Psychological  testing  was  conducted  extensively  in  the  1950s  and  1960s.  Although 
a  universally  accepted  model  of  perception  was  never  created,  a  wealth  of  data  was 
collected  regarding  the  intelligibility  and  comprehensibility  of  time-modified  words, 
sentences,  and  paragraphs. 

In  contrast,  phonological  testing  is  often  conducted  by  speech  pathologists.  The 
goal  is  to  model  human  perception  of  phonemes  (or  similar  segments  of  speech).  The 
results  are  discussed  in  terms  of  measurable  acoustic  features  and  how  their  presence  (or 
absence)  affects  perception. 

Although  phonological  tests  usually  measure  intelligibility,  they  differ  from 
psychological  tests  in  that  they  relate  the  intelligibility  to  acoustic  features  measured  from 
the  speech  signal.  For  example,  the  tests  are  concerned  with  questions  like  "how  much  of 
the  /s/  must  be  removed  fix)m  the  word  "said"  in  order  to  cause  the  word  "zed"  to  be 
perceived,"  and  "what  acoustic  features  are  used  to  distinguish  between  the  phonemes  Pol 
and/w/?" 

An  example  of  a  phonological  test  incorporates  an  algorithm  that  removes  the 
initial  portion  of  the  initial  consonant  /s/  in  the  word  "sit."  Multiple  tokens  of  the  word  are 
created  by  progressively  removing  longer  durations  of  the  initial  consonant.  The  tokens 
are  then  used  in  listening  tests  to  determine  how  duration  influences  perception  of  the 
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unvoiced  fricative  /s/  in  a  word-initial  position  in  a  consonant- vowel-consonant  (CVC) 
word. 

1.3  Review  of  Research 

The  review  of  the  research  is  divided  into  two  parts  according  to  the  type  of  test  that 
was  performed:  psychological  or  phonological.  Although  this  study  is  primarily  concerned 
with  phonological  testing,  review  of  psychological  tests  is  beneficial  since  there  is  some 
overlap  between  the  two  definitions,  and  since  many  factors  that  influence  listening  tests 
are  discussed  in  the  psychological  studies.  This  section  begins  with  a  discussion  of  the 
various  definitions  of  the  degree  of  time  modification  used  in  various  studies. 

1.3.1  Quantitative  Measures  of  Time  Modification 

Researchers  are  not  in  agreement  regarding  a  single  quantitative  measurement  (or 
definition)  of  the  amount  of  time  compression  or  expansion,  other  than  specifying  the 
initial  and  final  word  rates  in  words  per  minute  (w.p.m.). 

One  definition  in  use  defines  the  compression  ratio  (or  expansion  ratio)  as  the  ratio 
of  the  duration  of  the  resulting  time-altered  speech  to  the  duration  of  the  original  speech. 
For  example,  if  ten  seconds  of  speech  are  compressed  to  a  diuation  of  six  seconds,  the 
compression  ratio  is  60%,  or  0.6.  This  is  the  definition  used  in  this  study. 

A  second  definition  in  use  defines  the  the  compression  ratio  as  the  duration  of  the 
discarded  interval  (or  added  interval,  for  speech  expansion)  to  the  duration  of  the  original 
speech.  Note  that  this  is  the  converse  of  the  above  definition.  Thus,  for  tiie  same  durations 
as  for  the  above  example,  the  speech  compression  ratio  is  40%,  or  0.4. 

Alternatively,  the  compression  or  expansion  ratio  is  described  in  terms  of  the  ratio 
of  the  final  to  initial  word  rate,  both  defined  in  w.p.m.  This,  of  course,  requires  that  more 
than  one  word  is  being  modified. 


12 


Due  to  the  lack  of  agreement  in  the  definition  of  the  modification  ratio,  each 
researcher  typically  states  the  definition  that  he  or  she  uses.  Although  adequate,  this  creates 
problems  when  trying  to  compare  the  results  of  different  reports  that  use  different 
definitions.  Therefore,  for  the  purpose  of  comparison,  in  the  following  sections  all  of  the 
original  speech  compression  ratios  cited  in  the  literature  have  been  numerically  convened 
to  conform  with  the  definition  of  "compression  ratio"  used  in  this  study. 

1.3.2  Phonological  Tests 

This  subsection  discusses  the  various  phonological  tests  that  examine  the  effects  of 
removing  a  ponion  of  a  word  or  phoneme.  The  studies  relate  the  results  to  measurable 
acoustic  features,  such  as  duration  and  frequency  distribution. 

Cole  and  Cooper  (1975)  studied  the  effects  of  consonant  and  vowel  duration  on  the 
perception  of  the  voiced-voiceless  distinction  for  two  affricates  and  four  fricatives  in 
word-initial  position.  Shortening  the  duration  of  the  word-initial  frication  in  an  unvoiced 
affricate  or  unvoiced  fricative  changed  the  perceived  result  from  unvoiced  to  voiced. 
However,  changing  the  duration  of  the  following  vowel  had  little  effect  upon  the  perceived 
result. 

Denes  (1955)  studied  the  effects  of  consonant  and  vowel  duration  when  the 
consonant  occurred  in  a  word-final  position.  He  created  multiple  variations  of  two 
single-syllable  words  using  all  combinations  of  four  different  vowel  durations  and  five 
different  consonant  durations.  He  concluded  that  both  the  consonant  duration  and  the 
preceding  vowel  duration  had  a  significant  effect  upon  the  perceived  voicing  of  the  final 
consonant.  Short  consonant  durations  resulted  in  a  voiced  percept,  while  longer  consonant 
durations  resulted  in  an  unvoiced  percept.  Interestingly,  short  preceding- vowel  durations 
resulted  in  an  unvoiced  percept  of  the  consonant,  while  longer  preceding- vowel  durations 
resulted  in  a  voiced  percept  of  the  consonant.   It  was  concluded  that  perception  is  not 
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performed  on  a  "phoneme-by-phoneme"  basis,  and  that  the  attributes  of  a  single  phoneme 
affect  the  perception  of  adjacent  phonemes. 

Raphael  (1972)  also  studied  the  effect  of  preceding- vowel  duration  upon  the 
perceived  voicing  of  a  word-final  consonant  Raphael  reported  that  "regardless  of  the  cues 
for  voicing  or  voicelessness  used  in  the  synthesis  of  the  final  consonant  or  cluster,  Usteners 
perceived  the  final  segments  as  voiceless  when  they  were  preceded  by  vowels  of  short 
duration  and  as  voiced  when  they  were  preceded  by  vowels  of  long  duration."  This  agrees 
in  part  with  the  results  of  Denes  in  that  the  preceding- vowel  duration  plays  an  important 
role  in  the  perception  of  the  following  consonant.  However,  this  differs  from  Denes  in  that 
it  negates  the  importance  of  consonant  duration  in  perception  of  the  voicing  of  the  final 
consonant. 

Grimm  (1966)  smdied  the  effect  of  eliminating  different  amounts  of  the  initial 
portion  of  the  consonant  in  consonant- vowel  syllables.  It  was  reported  that  "the  listeners 
were  able  to  detect  cortect  place  of  articulation  more  accurately  than  either  voicing  or 
manner  of  release  as  greater  amounts  of  the  initial  part  of  a  syllable  were  removed."  Both 
initial  fricatives  and  initial  plosives  were  examined.  The  results  showed  that  as  the  initial 
consonant  duration  was  decreased,  ertors  in  identification  of  the  initial  fricatives  occurred 
more  gradually  than  ertors  in  the  identification  of  the  initial  plosives. 

Jongman  (1989)  studied  the  effect  of  varying  the  duration  of  frication  in 
consonant- vowel  (CV)  syllables.  His  results  showed  that  the  required  duration  of  frication 
for  cortect  fricative  identification  varied  depending  upon  the  specific  fricative.  In  addition, 
the  results  disagreed  with  those  of  Cole  and  Cooper  (1975)  in  that  "subjects  do  not  have 
a  tendency  to  identify  more  fricatives  as  voiced  as  the  frication  duration  decreases."  No 
explanation  was  given  for  this  disagreement. 

Summerfield,  Bailey,  Seton,  and  Dorman  (1981)  investigated  the  minimum 
duration  of  silence  between  "s"  and  "lit"  required  to  hear  the  word  "split."  The  results 
showed  that  typically,  "split"  was  heard  when  the  silent  interval  exceeded  about  50  ms. 
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However,  this  value  was  not  observed  to  be  constant  for  all  test  conditions.  They  reported 
that  "less  silence  is  required  (a)  if  the  intensity  fall  at  the  "s"  is  made  more  abrupt,  or  (b) 
if  the  durations  of  the  "s"  and  "lit"  segments  are  reduced." 

The  finding  that  less  silence  is  required  if  the  surrounding  segments  are  shortened 
is  important.  It  is  an  example  of  the  widely  believed  theory  that  a  listener's  perceptual 
"thresholds"  used  in  speech  perception  are  not  absolute,  and  vary  depending  upon  multiple 
factors  including  speaking  rate  and  speaking  style  (Gottfried  et  al.,  1990;  Lindblom,  1963; 
Miller,  1981;  Miller  and  Baer,  1983;  Miller  and  Liberman,  1979;  Summerfieldetal.,  1981). 

1.3.3  Psychological  Tests 

Due  to  the  large  number  of  studies,  as  well  as  the  variety  of  test  conditions  used  in 
psychological  tests,  it  is  difficult  to  divide  the  research  into  logical  units  for  comparison. 
One  approach  is  to  classify  the  research  based  upon  the  physical  device  used  to  modify  the 
speech.  However,  almost  all  of  the  quantitative  results  were  obtained  in  studies  that  used 
the  sampling  method  exclusively.  Therefore,  a  logical  division  and  discussion  of  the 
research  based  upon  method  of  implementation  is  inappropriate. 

There  are  some  common  threads,  however.  Almost  all  of  the  studies  measured 
either  the  intelUgibility  or  the  comprehensibility  of  time-modified  speech,  but  usually  not 
both  (see  Section  1 .2  for  definitions).  There  is  also  a  general  trend  in  the  literature  to  isolate 
and  focus  upon  the  different  factors  that  affect  rate-altered  speech  (Foulke  and  Sticht, 
1969),  which  is  logical,  since  the  ultimate  goal  of  psychological  studies  is  to  model  the 
abstract  levels  of  the  speech  perception  process.  As  an  example  of  the  research  focus  on 
individual  features,  one  study  measured  the  comprehension  of  rate-altered  speech  as  a 
function  of  the  age  of  the  test  subject,  while  another  study  measured  the  intelligibility  of 
rate-altered  speech  as  a  function  of  the  intelligence  of  the  test  subject. 

For  this  report,  the  review  of  the  psychological  tests  is  divided  into  two  subsections: 
the  studies  of  intelligibility,  and  the  studies  of  comprehensibility.  The  test  results,  in  terms 
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of  the  factors  that  influence  the  tests,  are  discussed  in  each  subsection.  This  approach  offers 
the  advantage  of  listing  all  of  the  factors  that  must  be  considered  (and  possibly  eliminated) 
in  designing  a  new  study. 

1.3.3.1  Intelligibility  tests  and  influencing  factors 

One  of  the  primary  test  factors  that  influences  intelligibility  is  the  method  used  to 
compress  or  expand  speech.  There  are  unique  problems  associated  with  each  of  the  time 
modification  methods  outlined  in  Section  1.1  of  this  report.  For  example,  speech 
compression  implemented  by  increasing  the  playback  speed  suffers  from  an  accompanying 
shift  in  the  fi-equency  content.  Thus,  speech  compressed  to  50%  of  its  original  duration  by 
the  variable  playback  rate  produces  different  results  than  speech  compressed  to  50%  of  its 
original  duration  by  the  sampling  method  (Garvey,  1953a). 

Garvey  (1953a)  studied  intelligibility  at  many  different  rates  using  both  the 
variable  playback  rate  method  and  the  sampling  method.  He  found  that  the  sampling 
method  produced  significandy  better  intelligibility  scores  than  the  variable  playback 
method  for  compressed  speech.  Among  his  findings  was  the  result  that  speech  compressed 
to  0.66  times  its  original  duration  had  a  mean  intelligibiUty  of  98.7%,  while  speech 
compressed  the  same  amount  by  the  variable  playback  rate  method  had  a  mean 
intelligibility  of  only  58%.  He  also  observed  that  speech  compressed  to  0.4  times  its 
original  duration  by  the  sampling  method  still  had  an  intelligibility  score  greater  than  90%. 

Fletcher  studied  the  effects  of  altering  the  speech  using  the  variable-playback-rate 
method  (Fletcher,  1929).  He  found  that  the  intelligibility  of  speech  decreased  sharply  for 
expanded  or  compressed  speech  as  the  rate  was  changed.  For  expanded  speech, 
intelligibility  decreased  to  approximately  50%  at  an  expansion  rate  of  1 .5.  For  compressed 
speech,  intelligibility  decreased  to  approximately  50%  at  a  compression  ratio  of  about 
0.625.  He  concluded  that  the  primary  reason  for  loss  of  intelligibility  was  the  frequency 
shift,  and  not  the  actual  speed  of  the  speech. 
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Another  factor  that  influences  intelligibility  is  the  duration  of  the  segment  to  be 
discarded  or  repeated  (these  durations  are  known  in  the  literature  as  the  "discard  interval" 
and  the  "repeat  interval",  respectively).  For  example,  a  50%  compression  rate  can  be 
accomplished  by  either  deleting  10  ms  from  every  20  ms  segment  of  speech,  or  by  deleting 
80  ms  from  every  1 60  ms  segment  of  speech.  While  these  two  methods  implement  the  same 
overall  quantitative  amount  of  compression,  they  have  different  perceptual  results.  This 
is  because  in  the  second  case,  the  80  ms  segment  that  is  deleted  may  delete  an  entire 
phoneme,  whereas  the  case  of  deleting  every  10  ms  has  a  much  smaller  chance  of  deleting 
an  entire  phoneme  (Fairbanks  and  Kodman,  1957).  This  factor  was  studied  by  several 
researchers.  Note  again  that  all  studies  used  the  sampling  method.  Garvey  compressed 
single  words  by  a  factor  of  0.5  using  a  variety  of  different  discard  interval  lengths  (Garvey, 
1953a).  His  results  showed  intelligibility  scores  of  95.33%,  95.67%,  95.0%  and  85.67% 
for  discard  intervals  of  40  ms,  60  ms,  80  ms,  and  100  ms,  respectively  Note  that  the 
intelligibility  decreased  significantly  for  the  longest  discard  interval.  Fairbanks  and 
Kodman  (Fairbanks  and  Kodman,  1957)  also  investigated  intelligibility  as  a  function  of 
the  duration  of  the  discard  interval.  They  found  that  intelligibility  decreased  dramatically 
when  the  discard  interval  changed  from  80  ms  to  160  ms  (no  discard  intervals  between 
these  two  values  were  tested). 

Two  additional  factors  that  also  affect  intelligibility  are  the  number  of  phonemes 
in  a  single  word  (for  a  single- word  test),  and  the  number  of  syllables  in  a  simple  utterance 
(for  a  simple-utterance  test).  It  has  been  reported  that  the  intelligibility  increased  as  the 
number  of  phonemes  increased  for  tests  involving  single  words  (Foulke  and  Sticht,  1 969). 
It  has  also  been  reported  that  the  intelligibility  increased  as  the  number  of  syllables 
increased  (Klumpp  and  Webster,  1961).  Intuitively,  this  seems  reasonable,  since  as  the 
number  of  phonemes  and  syllables  increases,  the  brain  can  invoke  higher  level  perceptual 
rules  ("levels"  of  perception  are  defined  here  in  an  abstract  sense)  in  an  attempt  to  string 
the  phonemes  together  to  make  the  incomplete  word  or  utterance  "make  sense." 
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Up  to  this  point,  all  of  the  factors  described  are  related  to  the  characteristics  of  the 
speech  tokens  themselves.  There  are  also  "human-based"  characteristics  that  affect 
intelligibility.  Note,  however,  that  most  of  the  studies  of  human  factors  have  been 
concemed  with  how  these  factors  affected  comprehension,  and  not  intelligibility.  Still,  a 
few  tests  specifically  relate  human  factors  to  intelligibility. 

It  has  been  reported  that  adaptation,  or  learning,  occurred  when  a  listener  was 
subjected  to  compressed  speech.  In  one  study,  repeated  exposure  to  compressed  words 
brought  about  a  small  increase  in  intelligibility  (Garvey,  1953b). 

The  hearing  capacity  (ability)  of  test  subjects  was  also  shown  to  affect  intelligibility 
of  time-altered  speech.  Calearo  and  Lazzaroni  (1957)  showed  tiiat  aged  subjects  with 
hearing  impairment  showed  a  sharper  decrease  in  intelligibility  scores  as  the  rate  of  speech 
was  increased,  when  compared  with  subjects  with  normal  hearing  capacity. 

1.3.3.2  Comprehensibility  tests  and  influencing  factors 

There  have  been  numerous  research  studies  to  examine  how  comprehension  is 
affected  by  different  experimental  factors.  It  was  stated  earlier  that  the  factors  that  affect 
intelligibility  may  also  affect  comprehension.  Therefore,  some  of  the  following  factors 
were  discussed  in  the  previous  subsection.  The  difference  between  the  earlier  subsection 
and  this  subsection  is  that  the  specific  studies  described  here  focus  on  how  the  various 
factors  directly  affected  comprehension,  and  not  intelligibility. 

Quite  surprisingly,  there  is  disagreement  in  findings  that  speech  compressed  by  the 
sampling  method  is  more  comprehensible  than  speech  compressed  by  the 
variable-rate-playback  method.  Foulke  found  no  significant  difference  in  the  two  methods 
when  presenting  compressed  speech  to  blind  test  subjects  (Foulke,  1964).  This  is 
contrasted  by  the  findings  of  McLain,  who  found  that  her  test  subjects,  also  blind,  scored 
better  when  listening  to  speech  produced  by  the  sampling  method  than  speech  produced 
by  variable  rate  playback  (McLain,  1962). 
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This  disagreement  is  puzzling,  since  there  is  agreement  concerning  findings  that  the 
method  used  strongly  affects  intelligibility.  Foulke  and  Sticht  suggested  that  the  difference 
between  intelligibility  and  comprehensibility  results  was  due  to  different  cognitive 
processes  being  invoked  for  the  two  different  tests  (Foulke  and  Sticht,  1969). 

Nonetheless,  speech  produced  by  the  sampling  method  simply  sounds  more  natural 
to  the  test  subject  than  does  speech  produced  by  the  variable  rate  playback  method.  So 
while  it  is  unclear  if  the  method  used  affects  comprehensibility,  the  sampling  method  is 
almost  always  chosen  due  to  its  more  pleasant  "sound." 

Another  fundamental  factor  that  affects  comprehension  is  word  rate.  Fairbanks, 
Guttman,  and  Miron  reported  that  there  was  little  difference  in  comprehension  at  rates  of 
141  w.p.m.,  201  w.p.m.,  and  282  w.p.m.  (Fairbanks  et  al.,  1957a).  At  470  w.p.m., 
comprehension  declined  to  26%,  compared  with  58%  at  282  w.p.m.  Another  study  found 
little  significant  change  in  the  comprehensibility  of  speech  presented  within  the  range  of 
126  w.p.m.  to  175  w.p.m.  (Diehl  et  al.,  1959).  This  is  in  general  agreement  with  other 
published  studies  that  showed  a  small  decline  in  comprehension  as  the  rate  was  increased 
to  about  300  w.p.m.,  and  then  a  much  quicker  decrease  in  comprehension  as  the  rate  was 
increased  above  300  w.p.m.  (Foulke,  1968). 

Listening  difficulty  also  affects  comprehension.  Unfortunately,  there  is  no  single 
measure  of  listening  difficulty,  so  it  is  difficult  to  compare  different  studies.  However,  in 
general  terms,  the  studies  that  have  been  performed  agreed  and  showed  that  as  the  word 
rate  was  increased,  more  difficult  selections  showed  a  quicker  decline  in  comprehension 
scores  than  did  easier  selections  (Harwood,  1955). 

As  with  intelligibility,  there  are  several  human-based  variables  that  affect 
comprehension.  A  listener's  intelligence  was  shown  to  influence  comprehension  scores 
at  different  word  rates  (Fairbanks  et  al.,  1957b).  This  study  showed  that  persons  with 
higher  intelligence  scored  better  on  comprehension  tests  as  the  speech  rate  was  increased. 
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Another  human-based  variable  that  influences  comprehension  is  the  test  subject's 
(silent)  reading  rate.  Goldstein  found  a  positive  correlation  between  reading  rate  and 
comprehension  scores  for  speeded  speech  (Goldstein,  1940). 

Prior  learning  or  exposure  to  compressed  speech  also  influences  comprehension 
test  results.  Orr,  Friedman,  and  Williams  found  that  students  that  received  systematic 
practice  in  listening  to  compressed  speech  consistently  scored  better  than  unpracticed 
students  in  comprehension  tests  of  speech  at  rates  greater  than  twice  the  normal  rate  (Orr 
et  al.,  1965).  At  rates  less  than  twice  the  normal  rate,  the  evidence  is  not  so  conclusive. 
Friedman  conducted  an  extensive  study  concerning  learning  and  compressed  speech,  and 
found  that  comprehension  at  325  w.p.m.  was  no  different  after  one  week  of  practice 
listening  dian  comprehension  with  no  practice  (Friedman,  1967).  Note  that  he  did  fmd  that 
one  week  of  practice  improved  comprehension  scores  at  rates  greater  than  325  w.p.m.  This 
resuh  that  training  and  practice  have  different  effects  at  different  word  rates  again  suggests 
that  the  processes  that  are  involved  in  a  human's  attempt  to  process  speech  above  300 
w.p.m.  may  be  different  than  the  processes  that  are  involved  in  processing  speech  below 
300  w.p.m. 

One  controversial  human  factor  is  the  visual  ability  of  the  listener.  The  intuitive 
belief  among  the  general  population  is  that  blind  listeners  can  comprehend  speech  at  a 
greater  rate  than  sighted  persons.  However,  this  has  not  been  proven  conclusively,  and  is 
a  matter  of  debate.  Crowley,  Lake,  and  Rathgaber  found  that  in  the  intermediate  grade 
levels,  sighted  students  scored  better  than  blind  students  in  comprehension  tests  (Crowley 
et  al.,  1965).  They  tested  subjects  at  175  and  225  w.p.m.  On  the  other  hand,  for  die  same 
age  group  of  school  children,  Bixler,  Foulke,  Amster,  and  Nolan  (Bixler  et  al.,  1961)  found 
that  "blind  school  children  can  be  given  information  at  a  rate  commensurate  with  that 
employed  by  normal  children  with  no  loss  in  comprehension."  Their  test  measured  no 
significant  differences  between  blind  and  sighted  children  at  rates  up  to  275  w.p.m.  In 
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another  study,  Hartlage  (1963)  found  that  there  was  no  significant  difference  in 
comprehension  between  blind  and  sighted  persons  at  normal  word  rates. 

While  this  result  is  surprising,  it  has  little  importance  in  practical  applications.  This 
is  because  the  average  Braille  reading  rate  for  blind  high  school  students  is  only  90  w.p.m. 
(Bixler  et  al.,  1961).  Thus,  from  a  practical  point  of  view,  if  blind  students  hsten  to  speech 
at  a  rate  of  125  w.p.m.,  they  are  still  effectively  increasing  their  "reading  rate"  by  a 
substantial  amount.  Note  that  a  rate  of  1 25  w.p.m.  is  not  considered  to  be  "speeded  speech," 
and  all  of  the  researchers  agree  that  at  this  low  rate,  there  is  little  or  no  difference  in 
comprehension  between  blind  and  sighted  students. 

1.4  Motivation 

The  majority  of  published  research  agrees  that  in  order  to  automatically  modify 
speech  duration,  one  must  periodically  remove  (or  repeat,  for  expanded  speech) 
constant-length  segments  of  a  speech  token  at  fixed  time  intervals,  with  no  regard  to  the 
phonemic  or  acoustic  content  of  the  specific  segment.  This  results  in  a  somewhat  effective 
method  of  modifying  the  speech  rate,  but  is  crude  and  ignores  the  fact  that  certain  portions 
of  the  speech  signal  carry  more  information  than  others.  As  a  result,  the  user  has  little 
control  over  the  information  that  is  eliminated. 

If  the  user  wants  to  modify  only  a  certain  portion  of  a  word  or  sentence,  he  or  she 
must  still  manually  edit  the  signal  to  produce  the  desired  test  token,  a  process  that  involves 
"cutting  and  splicing"  the  signal.  The  only  difference  in  this  procedure  between  present 
day  and  the  1950s  is  that  today,  this  cutting  and  splicing  is  done  with  the  help  of  a  waveform 
editor  that  is  implemented  on  a  digital  computer. 

This  process  of  computer-aided  cutting  and  splicing  is  exactly  what  is  used  in  many 
of  the  recent  phonological  studies  that  selectively  eliminate  portions  of  phonemes.  The 
term  "selectively"  is  defined  here  to  mean  that  the  user  can  select  the  segments  of  the  speech 
signal  to  be  discarded  (or  repeated).  While  the  cut  and  splice  method  has  the  advantage 
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of  being  precise,  it  has  the  disadvantages  of  being  time  consuming,  tedious,  and  requires 
sufficient  software.  It  also  requires  that  the  user  mark  the  segments  to  be  removed  or 
repeated,  and  concatenate  the  remaining  segments.  In  addition,  it  does  not  automatically 
prevent  or  smooth  any  discontinuities  that  may  result  from  joining  two  dissimilar 
waveforms.  Therefore,  the  sets  of  test  words  that  are  created  by  this  method  are  difficult 
for  other  researchers  to  recreate,  unless  precise  documentation  is  kept  during  the 
development  of  the  test  data. 

The  disadvantages  of  both  the  waveform  editors  and  the  other  previously  described 
methods  used  to  modify  the  duration  of  speech  provided  the  motivation  for  this  study.  The 
speech  research  community  could  benefit  from  a  system  that  creates  high  quality, 
time-modified  test  tokens  in  a  quick,  convenient,  repeatable,  and  selective  manner.  Ideally, 
the  system  would  eliminate  the  need  for  computer-based  waveform  editors  and  the 
associated  manual  cutting  and  pasting  processes.  In  addition,  it  would  be  desirable  if  the 
user  could  control  the  system  with  parameters  that  are  closely  related  to  the  acoustic 
features  of  the  speech  signal.  This  would  greatly  decrease  the  training  time,  and  in  general, 
would  make  the  system  easier  for  the  speech  researcher  to  operate.  The  system  might  also 
inspire  new  research  into  time-modified  speech,  due  to  the  increased  levels  of  efficiency 
and  flexibility  that  were  previously  unavailable  in  a  time  modification  system. 

1.5  Goals 

Given  the  desire  for  a  better  time  modification  system,  this  study  set  forth  the 
following  goals:  The  first  goal  was  to  develop  and  implement  a  new  time  modification 
system  that  allows  the  user  to  selectively  modify  certain  portions  of  a  speech  signal,  based 
upon  the  signal's  time-varying  acoustical  composition.  In  order  to  aid  the  speech 
researcher,  the  segments  are  similar  to  the  set  of  phoneme  types  (i.e.  vowels,  nasals, 
semivowels,  etc.).  To  do  this,  a  software  tool  was  created  that  first  analyzes  the  speech 
signal  to  determine  the  identity  of  the  different  phonetic  segments,  and  then  independently 
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modifies  the  durations  of  the  phonetic  segments  according  to  global  parameters  specified 
by  the  user.  The  time  modification  is  done  automatically  without  the  use  of  waveform 
editors.  In  addition,  the  software  is  written  in  the  MATLAB  programming  language,  and 
can  easily  be  ported  at  relatively  low  cost  to  a  wide  variety  of  computing  platforms. 

The  time  modification  system  incorporates  a  graphical  user  interface  (GUI)  that 
frees  the  user  from  having  to  remember  any  complicated  command-line  syntax.  All  of  the 
modification  parameters  are  adjusted  by  using  a  mouse  to  move  and  select  slide-bar  and 
push-button  controls  that  are  displayed  in  various  windows.  After  the  modification 
parameters  are  specified,  the  resulting  time-modified  speech  is  synthesized  and  played 
with  the  click  of  a  button. 

The  second  goal  of  this  study  was  to  test  the  time  modification  system  to  ensure  that 
it  was  capable  of  creating  high  quality,  synthesized  test  tokens.  To  do  this,  a  set  of 
time-modified  speech  tokens  was  created  and  used  in  studies  of  perception  in  both  informal 
and  formal  listening  tests.  The  formal  listening  test  was  chosen  to  be  similar  to  several  of 
the  phonological  smdies  described  in  Section  1.3.2.  By  closely  approximating  published 
tests,  the  ability  of  the  system  to  create  high  quality  synthesized  test  tokens  was 
investigated. 

The  third  goal  was  to  compare  the  results  of  the  formal  listening  test  with  the  results 
of  similar  published  research.  Particular  attention  was  given  to  the  perception  of  initial 
consonants  in  single-syllable,  consonant-vowel  (CV)  and  consonant-vowel-consonant 
(CVC)  words  when  different  portions  of  the  initial  consonant  were  removed.  A 
comparison  of  previous  research  shows  that  in  each  study,  typically  only  one  portion  of  the 
initial  consonant  was  removed  or  modified.  For  example,  one  investigator  reduced  the 
duration  of  a  word-initial  consonant  by  preserving  the  beginning  portion  of  the  consonant 
(Cole  and  Cooper,  1975),  while  another  investigator  reduced  the  duration  of  a  word-initial 
consonant  by  preserving  the  end  portion  of  the  consonant  (Grimm,  1966).  One  of  the  goals 
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in  this  study  was  to  determine  if  there  were  significant  changes  in  perception  when  various 
portions  (i.e.  the  beginning,  middle,  or  end)  of  the  initial  consonants  were  modified. 

1.6  General  System  Description 

This  section  presents  an  overview  of  the  system  that  is  used  for  time  modification. 
A  block  diagram  of  the  entire  system  is  shown  in  Figure  1-1 .  The  three  main  stages  of  the 
system  are  (1)  the  speech  analysis,  segmentation,  and  labeling  stage,  (2)  the  manual 
correction  stage  (for  optional  correction  of  the  segmentation  and  labehng  results),  and  (3) 
the  segment  time  modification  and  synthesis  stage.  Both  the  natural  speech  input  signal 
and  the  synthesized  speech  output  signal  are  sampled-data  time-domain  signals.  The 
sampling  frequency  is  fixed  at  fg  =  10  kHz. 

The  first  stage  works  automatically  with  no  input  fix)m  the  user,  other  than  the 
sampled-data  input  signal.  This  stage  divides  the  signal  into  pitch-synchronous  frames 
(pitch-asynchronous  for  unvoiced  and  silent  speech)  and  performs  a  Unear  predictive 
coding  (LPC)  analysis  for  each  fi-ame.  The  fi-ames  are  then  grouped  into  segments,  and 
each  segment  is  labeled  with  the  most  appropriate  phonemic  type  label  (i.e.  vowel, 
semivowel,  etc.).  This  entire  process  is  accomplished  by  a  series  of  software  programs  that 
extract  the  acoustic  features  from  the  signal  and  compare  the  relative  contribution  of  each 
feature  to  the  specific  speech  segment. 

The  second  stage  manually  corrects  the  automatic  segmentation  and  labeling 
results.  This  is  only  required  if  the  automatic  segmentation  and  labeling  stage  makes 
mistakes.  Determination  of  whether  or  not  a  mistake  has  been  made  is  left  to  the  discretion 
of  the  user.  In  this  stage,  a  set  of  software  programs  with  a  graphical  user  interface  (GUI) 
allows  the  user  to  display  and  graphically  edit  the  segment  boundaries  and  labels.  The  user 
adjusts  the  results  by  moving  sliders  and  pushing  buttons  (with  a  mouse)  on  the  computer 
display. 
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Figure  1-1.    Block  diagram  of  the  time  modification  system. 
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The  third  stage  performs  the  actual  time  modification  process.  It  also  performs  the 
synthesis  of  the  resulting,  time-modified  speech.  This  stage  uses  a  set  of  software  programs 
with  a  graphical  user  interface  (GUI)  that  allows  the  user  to  graphically  specify  how  the 
speech  signal  is  modified.  Each  type  of  phoneme  has  its  own  modification  parameters,  and 
in  addition,  each  segment  can  also  have  its  own  modification  parameters  independent  of 
phoneme  type,  if  desired.  Once  the  parameters  are  all  specified,  the  third  stage  synthesizes 
the  time-modified  speech  using  an  LPC  speech  synthesizer. 

1.7  Chapter  Organization 

Chapter  2  details  the  automatic  speech  analysis,  segmentation,  and  segment 
labeling  programs  that  comprise  the  first  stage  in  the  time  modification  task.  It  also 
discusses  the  mistakes  made  by  the  automatic  programs,  and  describes  a  software  tool 
developed  to  manually  correct  any  mistakes  in  the  segmentation  and  labeling  results. 
Chapter  3  describes  the  time  modification  programs  and  the  associated  graphical  user 
interface.  Chapter  4  describes  the  development  and  results  for  both  the  formal  and  informal 
listening  tests.  Chapter  5  presents  a  discussion  of  the  formal  listening  test  results.  Chapter 
6  summarizes  the  study,  and  suggests  ideas  for  future  research. 


CHAPTER  2 
SPEECH  ANALYSIS,  SEGMENTATION,  AND  LABELING 


The  time  modification  system  in  this  study  varies  the  durations  of  selected 
segments  of  the  speech  signal.  Possible  segments  include  vowels,  nasals,  unvoiced 
fricatives,  etc.  Each  segment  duration  is  modified  according  to  parameters  specified  by 
the  user.  These  parameters  apply  to  either  all  occurrences  of  a  specific  type  of  segment, 
or  to  a  single  occurrence  of  a  segment.  One  example  of  a  parameter  that  applies  to  a  vowel 
is  the  vowel  scale  factor,  SFvowei-  The  vowel  scale  factor  specifies  the  desired  ratio  of  the 
duration  of  the  vowel  segment(s)  in  the  time-modified  word  to  the  duration  of  the 
corresponding  vowel  segment(s)  in  the  original,  unmodified  word. 

Since  the  speech  segments  are  the  basis  of  the  time  modification  system,  it  is 
important  that  the  segments  are  accurately  detected  and  identified  (for  brevity,  the 
detection  and  identification  processes  will  hereafter  be  called  detection).  To  accomplish 
this  goal,  the  system  designer  is  faced  with  one  of  two  choices:  manual  detection  (by  hand), 
or  automatic  detection  (software).  While  manual  detection  provides  good  results,  it  is 
extremely  tedious  and  time  consuming.  This  limits  the  usefulness  of  the  overall  time 
modification  system.  Automatic  detection  is  quick  and  relatively  "painless,"  but  is  more 
prone  to  mistakes  than  the  manual  method  and  requires  significant  initial  development 
time. 

As  a  compromise,  this  study  uses  automatic  detection  of  the  speech  segments  with 
subsequent  manual  editing  to  correct  errors  that  may  occur  in  the  detection  process. 
Automatic  detection  consists  of  three  main  steps:  (1)  speech  analysis,  (2)  segmentation  of 
the  word  or  sentence  into  segments  of  unknown  type,  and  (3)  appropriate  labeling  of  these 
segments.  The  manual  editing  process  allows  the  user  to  display  and  edit  the  automatic 
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segmentation  and  labeling  results  using  a  set  of  software  programs  with  a  convenient, 
easy-to-use,  graphical  user  interface  (GUI).  The  GUI  allows  the  user  to  insert  silent 
segments,  change  segment  boundaries,  change  segment  labels,  and  merge  together 
adjacent,  like  segments,  all  with  the  click  of  a  workstation  mouse. 

This  chapter  describes  both  the  automatic  and  manual  algorithms  used  to  analyze, 
segment,  and  label  the  speech  segments.  It  begins  with  a  discussion  of  the  selection  of 
speech  segment  categories.  Next,  a  brief  overview  of  the  automatic  segment  detection 
process  is  presented  with  an  example  of  automatic  detection  of  the  voiced/  unvoiced/  silent 
(V/U/S)  parameter  for  the  word  "sue."  The  example  illustrates  the  methods  associated  with 
automatic  detection  of  a  single  parameter,  or  "feature,"  of  speech.  The  relevance  and 
application  of  these  methods  to  the  general  set  of  feature  detection  programs  is  described. 
Each  of  the  automatic  feature  detection  algorithms  is  then  detailed.  The  segmentation  and 
labeling  processes  are  also  presented  in  detail.  The  nature  of  the  mistakes  produced  by  the 
automatic  algorithms  are  discussed,  and  the  manual  editing  system  and  corresponding  GUI 
are  described. 

2.1  Selection  of  Speech  Segment  Categories 

The  complexity  of  the  algorithms  for  segmentation  and  labeling  in  any  speech 
analysis  task  depends  upon  the  degree  of  recognition  that  must  be  achieved  (Davis  and 
Mermelstein,  1980).  For  a  given  speech  sound,  algorithms  that  determine  only  the 
phoneme  category  will  be  less  complicated  than  algorithms  that  determine  not  only  the 
category,  but  the  identity  of  the  phoneme  as  well.  Likewise,  for  a  given  speech  sound, 
algorithms  that  determine  the  allophonic  variation  of  a  particular  phoneme  can  be  expected 
to  be  complicated,  due  to  the  number  of  variations  of  the  pronunciation  of  a  single  phoneme 
that  can  occur  during  conversational  speech  (Klatt  and  Stevens,  1973).  Thus,  the 
complexity  of  the  segmentation  and  labeling  task  is  dependent  upon  both  the  number  and 
choice  of  categories  used  to  subdivide  the  speech. 
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There  are  several  possibilities  for  selecting  speech  segment  categories.  In  a 
top-down  paradigm,  the  simplest  choice  is  to  classify  speech  as  voiced,  unvoiced,  or  silent 
(VAJ/S).  The  next,  more  complex  choice  is  to  classify  each  speech  sound  as  a  member  of 
one  of  the  basic  phoneme  types.  Although  the  exact  description  and  number  of  different 
phoneme  categories  vary  sUghtiy  depending  upon  the  school  of  thought,  these  categories 
usually  include  vowels,  nasals,  semivowels,  voiced  fricatives,  voiced  stops,  unvoiced 
fricatives,  unvoiced  stops,  and  silence.  An  even  more  complex  categorization  requires 
identification  of  the  exact  phoneme.  This  requires  matching  the  segment  under 
consideration  with  one  of  the  47  phonemes  in  English  (Borden  and  Harris,  1984). 

For  this  study,  speech  is  divided  into  eight  segment  categories:  vowels, 
semivowels,  nasals,  voiced  firicatives,  voice  bars,  unvoiced  stops,  unvoiced  fricatives,  and 
silence.  Overall,  this  choice  is  a  compromise  between  the  complexity  of  the  segment 
recognition  algorithms  and  the  resolution  of  the  resulting  speech  segments.  Since 
automatic  language  understanding  is  not  required  in  this  project,  recognition  of  individual 
phonemes  is  not  required.  This  greatiy  reduces  tiie  complexity  of  tiie  segment  recognition 
algorithms,  since  the  choices  in  the  matching  process  are  reduced  from  47  to  eight.  In 
addition,  it  is  easier  to  recognize  the  typically  large  differences  between  phoneme 
categories  than  it  is  to  recognize  smaller  differences  between  various  phonemes  of  the  same 
category  (Schwartz  and  Makhoul,  1975). 

2.2  Overview  of  Automatic  Segment  Detection 

Automatic  detection  of  the  speech  segments  is  accomplished  by  a  series  of  software 
programs  tiiat  sequentially  analyze  tiie  speech  signal.  These  programs  are  grouped 
according  to  the  task  tiiey  perform  in  the  detection  process.  The  three  main  tasks  are  shown 
in  Figure  2-1  as:  (1)  speech  analysis,  (2)  segmentation  of  the  woni  or  sentence  into 
segments  of  unknown  type,  and  (3)  labeling  of  these  segments  witii  tiie  most  appropriate 
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Figure  2-1 .    Block  diagram  of  automatic  speech  detection. 
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segment  label.  A  brief  overview  of  each  of  these  tasks  follows,  and  they  are  discussed  in 
detail  in  later  sections  of  this  chapter. 

Speech  analysis  is  the  most  complicated  of  the  three  tasks,  and  is  divided  into 
several  steps.  A  block  diagram  of  the  speech  analysis  task  is  shown  in  Figure  2-2.  The 
initial  analysis  and  decomposition  of  the  speech  waveform  is  derived  from  a  two-pass 
method  developed  by  Hu  (1993).  In  the  first  pass,  the  sampled  waveform  is  divided 
asynchronously  into  5  ms  frames.  A  13th-order,  linear  predictive  coding  (LPC)  analysis 
is  performed  for  each  frame,  and  the  residue  is  processed  to  determine  the  glottal  closure 
points  (Hu,  1993).  Only  the  glottal  closure  indices  (GCI)  are  retained  for  use  in  the  second 
pass  of  the  algorithm.  In  the  second  pass,  the  sampled  waveform  is  again  divided  into 
frames.  The  frames  are  chosen  pitch  asynchronously  for  unvoiced  speech  and  silence,  and 
pitch  synchronously  for  voiced  speech,  using  the  glottal  closure  indices  as  a  reference.  A 
13th-order,  LPC  analysis  is  performed  for  each  frame,  and  the  residue,  LPC  coefficients, 
and  power  are  saved  for  later  modification  and  synthesis.  Next,  a  set  of  feature  detection 
algorithms  analyzes  each  frame  individually.  Each  algorithm  in  the  set  detects  a  different 
acoustic  feature.  For  example,  one  of  the  algorithms  detects  the  presence  or  absence  of 
nasals,  while  another  algorithm  detects  the  presence  or  absence  of  semivowels.  Each 
feature  detection  algorithm  uses  a  combination  of  fixed  thresholds,  median  filtering,  and 
empirical  rules  to  calculate  the  final  result,  or  "feature  score." 

The  second  automatic  detection  task  shown  in  Figure  2-1  is  determination  of  the 
time-domain  boundaries  that  separate  the  segments  of  the  speech  signal.  This  process  is 
known  as  segmentation.  The  boundaries  are  chosen  such  that  each  segment  has  relatively 
stable  acoustic  properties  for  the  duration  of  the  segment.  Segmentation  is  accomplished 
by  combining  the  results  of  two  different  algorithms.  The  first  algorithm  determines  the 
changes  in  the  "trend"  of  the  short-term  frequency  spectra,  and  the  second  uses  the  results 
of  the  voiced  /  unvoiced  /  silent  (VAJ/S)  feature  detector. 
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Figure  2-2.    Block  diagram  of  speech  analysis. 
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The  third  task  in  Figure  2-1  is  labeling  of  the  segments  using  the  results  obtained 
from  the  feature  detection  algorithms.  Examples  of  labels  are  vowel,  semivowel,  and 
unvoiced  fricative.  Labeling  is  done  in  two  steps.  First,  the  average  feature  detection 
scores  are  calculated.  Next,  empirical  rules  are  applied  to  the  average  scores  to  determine 
the  most  appropriate  label  for  each  segment. 

2.3  Feature  Detection  Algorithms — General  Development 

Acoustic  feature  detection  is  the  search  for  different  (acoustic)  features.  Examples 
of  acoustic  features  include  voicing,  nasality,  and  sonorance.  While  acoustic  features  are 
used  to  help  differentiate  between  the  various  segment  categories,  it  is  important  to  realize 
that  individual  acoustic  features  may  not  be  unique  to  one  particular  segment  category.  For 
example,  nasality  may  indicate  the  presence  of  a  nasal,  or  it  may  indicate  the  presence  of 
a  nasalized  vowel.  Thus,  in  this  example,  one  acoustic  feature  is  common  to  two  different 
segment  categories.  This  lack  of  one-to-one  correspondence  between  acoustic  features  and 
segment  categories  requires  that  multiple  acoustic  features  be  evaluated  and  weighed  when 
attempting  to  match  an  unknown  speech  segment  with  the  most  appropriate  segment  label. 

Although  it  is  logical  to  use  the  term  "segment  detector"  to  define  an  algorithm  that 
detects  one  of  the  eight  segment  types  listed  in  Section  2.1,  this  term  is  misleading,  since 
it  can  be  confused  with  the  previously  defined  definition  of  segmentation,  which  is  the  task 
of  dividing  the  speech  signal  into  segments  of  unknown  type.  Therefore,  in  this  study,  the 
term  "feature  detector"  is  used  in  a  broad  sense,  and  implies  both  an  algorithm  that  detects 
a  single  acoustic  feature,  as  well  as  an  algorithm  that  detects  multiple  acoustic  feamres  in 
order  to  detect  one  of  the  eight  segment  types  listed  in  Section  2.1. 

In  general  terms,  feature  detection  is  achieved  by  algorithms  that  examine  die 
short-term  frequency  spectra  of  the  speech  signal.  The  spectra  are  calculated  from  the  LPC 
coefficients  that  are,  in  turn,  calculated  during  the  initial  analysis  stage  for  each  frame  of 
the  signal.  It  has  been  shown  that  the  short-term  frequency  spectra  method  is  a  reliable 
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technique  used  in  a  wide  variety  of  recognition  systems  (Bush  et  al.,  1983;  Glass  and  Zue, 
1986;  Glass  and  Zue,  1988;  Klatt,  1977;  Leung  et  al.,  1993;  McCandless,  1974;  Meng  and 
Zue,  1991;  Mermelstein,  1977;  Weinstein  et  al.,  1975;  Zue  et  al.,  1989). 

Each  feature  detection  algorithm  utilizes  a  sequence  of  processing  stages  to 
calculate  the  resulting  feature  score.  In  many  instances,  the  structure  of  each  of  the  feature 
detection  algorithms  is  similar,  although  the  exact  numerical  values  may  differ. 

The  detection  of  acoustic  features  from  the  speech  signal  is  the  most  complicated 
portion  of  the  analysis,  segmentation,  and  labeling  process.  Because  of  the  complexity  of 
the  feature  detection  algorithms,  the  explanation  of  the  algorithms  is  broken  down  into  two 
sections.  In  this  section,  a  simple  example  is  given  to  explain  one  feature  detection 
algorithm  and  its  development  and  implementation.  Although  the  example  is  for  a  single 
feature  detector,  it  illustrates  the  general  structure  of  the  majority  of  the  feature  detectors. 
The  example  also  discusses  the  problems  and  considerations  associated  with  the  set  of 
feature  detection  algorithms  as  a  whole.  In  the  next  section,  the  algorithms  are  detailed 
individually,  examining  the  specific  equations  of  the  algorithms. 

The  feature  detection  algorithms  use  a  combination  of  methods  to  produce  the  final 
resuhs.  These  methods  include  bandpass  filters,  fixed  thresholds,  median  filter  smoothing, 
and  empirical  pattern  recognition  rules.  These  methods  are  used  in  a  similar  manner  in  each 
of  the  algorithms.  The  example  that  follows  illustrates  how  these  methods  work  in  a  feature 
detection  algorithm  that  detects  the  voiced  /  unvoiced  /  silence  (V/U/S)  feature  of  speech. 

2.3.1  Input  Data  and  V/U/S  Pre-Processing 

All  of  the  feature  detection  algorithms  require  the  LPC  results  as  input  data.  Most 
of  the  feature  detectors  also  require  the  results  from  other  feature  detectors,  (specifically 
the  VAJ/S  results),  as  shown  in  Figure  2-2. 

V/U/S  classification  is  different  from  the  other  feature  detection  algorithms  in  that 
a  portion  of  the  algorithm  is  accomplished  during  the  initial  LPC  analysis  algorithm. 
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During  the  first  pass  of  the  LPC  algorithm,  the  first  reflection  coefficient  is  calculated  for 
each  pitch-asynchronous  firame.  The  fi-ame  is  classified  as  voiced  (V)  if  the  reflection 
coefficient  is  greater  than  0.2,  and  is  classified  as  unvoiced  (U)  if  the  reflection  coefficient 
is  less  than  or  equal  to  0.2.  This  threshold  was  determined  empirically  by  Hu  (1993).  In 
addition,  Hu  makes  no  distinction  between  unvoiced  and  silent  frames.  Therefore,  all  silent 
frames  are  classified  as  unvoiced.  During  the  second  pass  of  the  LPC  algorithm,  certain 
frames  are  labeled  as  transitional  frames  (T).  The  first  voiced  fiume  in  a  unvoiced-voiced 
sequence,  and  the  last  voiced  frame  in  a  voiced-unvoiced  sequence  are  changed  to 
transitional  frames.  Hu's  explanation  for  this  is  that  mistakes  may  be  made  in  the  simple 
VAJ  decision  process,  so  the  frames  at  the  transition  regions  are  marked  since  this  is 
typically  where  the  mistakes  are  made.  Since  it  has  been  observed  that  the  transition  frames 
are  always  voiced,  all  transition  frames  are  converted  to  voiced  frames  in  this  study. 

2.3.2  Volume  Function 

A  volume  function,  V(i),  similar  to  one  presented  by  Weinstein  et  al.  (1975)  is 
calculated  for  each  frame  to  determine  a  quantity  analogous  to  the  loudness,  or  acoustic 
volume,  of  the  signal  at  the  output  of  a  hypothetical  bandpass  filter.  This  is  the  first 
processing  step  in  the  majority  of  feature  detectors.  The  volume  function  is  normalized 
by  the  number  of  samples  in  the  frame,  and  is  given  by 


m  =  B 


V(i)  =    r^    /    X  IHi(eJ"^|2  (2.1) 


m  =  A 


where  i  is  the  current  frame  index,  Ni  is  the  number  of  samples  in  firame  i,  A  is  the  index 
of  the  low  cutoff  frequency  of  the  bandpass  filter,  B  is  the  index  of  the  high  cutoff  frequency 
of  the  bandpass  filter,  and  Hi(eJ''^  is  the  complex,  single-sided,  frequency  response  of  the 
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IIR  filter,  Hi(z),  produced  by  the  LPC  coefficients  and  evaluated  at  the  points 
exp(jjtm/256),  for  0  <  m  <  255.  Hi(z)  is  given  by 

'^^      ao  +  ajz-i  +  a2Z-2  +  ...  +  a^jZ-N  (2-2) 

where  N  =  13,  ao  =  1,  and  G(i)  is  given  by 


G(i)  =      Y.  ^(") 


n  =  t 

2 


(2.3) 


n  =  s 


where  r(n)  is  the  value  of  the  LPC  residue  at  sample  n,  i  is  the  current  frame  index,  s  is  the 
beginning  sample  number  of  the  current  frame,  and  t  is  the  ending  sample  number  of  the 
current  frame. 

The  volume  function  of  Equation  2. 1  is  used  extensively  in  this  study,  although  the 
frequency  range  of  the  bandpass  filter  varies  depending  upon  the  specific  detector.  In 
addition,  many  of  the  feature  detection  algorithms  calculate  the  ratio  of  two  volume 
functions,  each  with  its  own  frequency  range.  This  compares  the  energy  in  one  frequency 
band  to  the  energy  in  a  second  frequency  band. 

In  the  majority  of  feature  detectors,  median  filtering  is  done  to  smooth  any  large, 
short-term,  fluctuations  in  the  volume  function.  The  fluctuations  are  caused  by  a  variety 
of  sources  including  incorrect  GCI  determination,  incorrect  VAJ/S  classification,  and 
recording  artifacts  such  as  tape  hiss  and  background  noise.  Although  the  majority  of  the 
feature  detectors  use  a  5th-order  median  filter  for  smoothing,  the  exact  order  is  given  in 
the  detailed  description  of  each  detector.  The  filter  order  is  determined  empirically  in  each 
case. 

The  V/U/S  detector  uses  a  single  volume  function  of  Equation  2.1  with  the  values 
A  =  17  and  B  =  255.  The  lower  limit  of  A  =  17  serves  to  highpass  (HP)  filter  the 
frequency  response  with  a  cutoff  frequency  of  312  Hz  (the  upper  limit  B  =  255 
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corresponds  to  one-half  the  sampUng  rate,  thus  a  highpass  instead  of  a  bandpass  filter). 
Weinstein  claims  that  the  HP  filter  is  needed  to  reduce  the  sensitivity  to  voiced  stops,  but 
experiments  show  that  its  primary  effect  is  to  reduce  low  frequency  artifacts  such  as  wind 
noise  and  other  pop-like  sounds  caused  by  non-optimum  microphone  placement  during  the 
recording  process.  In  general,  the  volume  function  is  used  in  the  V/U/S  detector  as  a 
relatively  wide-band  integrator  that  calculates  the  approximate  energy  in  each  frame.  The 
role  of  this  integrator  is  discussed  in  the  next  section. 

A  graph  of  the  frequency  response  of  Hi(z)  before  and  after  the  hypothetical 
highpass  filter  for  one  pitch  period  of  the  vowel  portion  of  the  word  "sue"  spoken  by  a  male 
speaker  is  shown  in  Figure  2-3.  As  described  in  Equation  2.1,  the  single  sided  frequency 
response  of  Hi(z)  is  evaluated  at  256  equally  spaced  points  around  the  upper  half  of  the  unit 
circle  in  the  z-plane.  Although  it  is  not  shown  on  the  graph,  the  value  of  V(i)  for  the  pitch 
period  analyzed  in  Figure  2-3b  is  70.62  dB.  It  is  seen  from  the  graph  that  this  is 
approximately  the  average  level  of  the  frequency  response. 

Note  also  that  unlike  the  other  feature  detectors,  median  filtering  is  not  performed 
on  the  VAJ/S  volume  function.  This  is  to  ensure  that  any  short-term  energy  fluctuations, 
such  as  those  produced  by  stops,  are  not  inadvertenfly  smoothed. 

2.3.3  Fixed  Thresholds  and  Feature  Scores 

Each  feature  detection  algorithm  calculates  a  feature  score  to  indicate  the  presence 
of  the  corresponding  acoustic  feature  in  a  given  frame  of  speech.  The  feature  score  is 
typically  continuous  over  the  range  [0,1],  altiiough  there  are  exceptions  (several  of  the 
feature  scores  are  discrete,  either  binary  or  trinary).  In  general,  the  feature  score  is 
calculated  by  comparing  die  value  of  the  volume  function  (or  the  ratio  of  two  volume 
functions)  with  one  or  more  fixed  thresholds.  The  values  of  the  thresholds  are  determined 
empirically  by  trial  and  error  during  the  analysis  of  approximately  100  words  of  the 
Diagnostic  Rhyme  Test  (DRT)  spoken  by  two  male  and  one  female  speakers  (Voiers,  1983). 
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Figure  2-3.      Hi(z)  used  to  calculate  the  VAJ/S  volume  function  for 
one  pitch  period  of  the  vowel  portion  of  the  word  "sue." 

a)  Hi(z)  before  filtering; 

b)  Hi(z)  after  filtering. 
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In  some  cases,  initial  estimates  for  the  thresholds  are  taken  finom  literature  (sources  are 
listed  in  discussion  of  the  specific  algorithms),  and  the  thresholds  are  then  "fine  tuned" 
using  the  DRT  speech  data. 

The  empirical  determination  of  these  thresholds  constitutes  a  type  of  "learning" 
phase  in  the  algorithm  development.  This  contrasts  one  of  the  initial  goals  of  the  automatic 
segmentation  and  labeling  process,  which  is  to  not  require  "training"  of  the  algorithms. 
However,  given  die  nature  and  variability  of  the  speech  signal,  it  now  seems  impossible 
to  create  a  set  of  reliable  segmentation  and  labeling  algorithms  based  upon  frequency 
distributions  (i.e.  volume  functions)  without  some  type  of  training,  or  parameter  "tuning." 

The  advantages  and  disadvantages  of  training  are  obvious.  If  die  training  data  does 
not  accurately  represent  the  set  of  intended  users,  the  algoritiims  will  not  function  as 
expected  in  practice.  If  the  training  data  completely  represents  the  set  of  intended  users, 
the  algorithms  will  work  efficiently  and  accurately.  Since  the  topic  of  training  of  speech 
recognition  algorithms  is  beyond  die  scope  of  this  dissertation,  it  will  be  accepted  that 
training  is  mandatory,  regardless  of  the  particular  algorithm. 

In  general,  if  two  thresholds  are  used,  the  feature  score  for  each  fi^me  is  determined 
by 


Feature_Score(i)  =  •■ 


1, 
0, 


if    VoLFcn(i)  >  Tupper 

if  Vol_Fcn(i)  <  Ti,,„ 


Vol_Fcn(i)  -  T,„^„ 

'if    Ti^^er  ^  V0l_Fcn(i)   <  Tapper 


Tupper        T\ower 


(2.4) 


for  a  feature  score  that  increases  as  the  volume  function  increases.   If  the  feature  score 
decreases  as  die  volume  function  increases,  die  feature  score  is  given  by 
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Feature_Score(i)  =  « 


0, 
1. 

Tupper  -  Vol_Fcn(i) 


Tupper        Tlower 


if    Vol_Fcn(i)  >  Tupper 

if  Vol_Fcn(i)  <  Ti„^^ 


'if    Ti„^^  <  Vol_Fcn(i)   <  Tupper 


(2.5) 


For  both  equations,  i  is  the  current  frame  index,  Tiower  is  the  fixed  lower  threshold,  Tupper 
is  the  fixed  upper  threshold,  and  Vol_Fcn(i)  is  the  volume  function  (or  the  ratio  of  two 
volume  functions)  for  the  current  frame.  Both  Equation  2.4  and  Equation  2.5  are  used  in 
practice.  If  only  one  threshold  is  used  to  calculate  a  binary  feature  score  (zero  or  one),  then 
either  Equation  2.4  or  Equation  2.5  is  used  with  T^^^^  =  Tupper- 

The  original  LPC  analysis  described  in  Section  2.3.1  distinguishes  only  between 
voiced  and  non- voiced  fiames.  The  non-voiced  frames  denoted  by  Hu  (1993)  as 
"unvoiced"  are  either  unvoiced  or  silent.  The  procedure  used  in  the  V/U/S  detector  to 
classify  non-voiced  frames  as  either  unvoiced  or  silent  is  based  upon  a  single  volume 
function,  the  background  noise  power  in  the  speech  signal,  and  Equation  2.4.  First,  the 
mean  and  the  standard  deviation  of  the  background  noise  power,  BNP,  are  calculated  as 


n  =  20 


BNP. 


=  il  p^'^) 


(2.6) 


n  =  l 


BNP 


n  =  20 
1    V 


stddev 


^   2.  (P(")  -  BNP„,ean)^ 


(2.7) 


n=l 


where  p(n)  is  the  frame  power  in  decibels  (dB).  It  is  assumed  that  the  first  100  ms  (20 
frames)  of  the  speech  signal  are  silence. 

The  V/U/S  volume  function  for  each  non-voiced  frame  is  then  compared  to  a 
constant  threshold,  Tu/s,  using  Equation  2.4  with  Tj^^^  =  Tupper  and  Tupper  =  T^/s. 
The  V/U/S  feature  score  for  each  non-voiced  frame  is  given  as 
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VUS_Score(i)  =  - 


1,  if        20  logio{V(i)}  >  Tu/s 

0,  if        20  logio{V(i)}  <  Tu/s 


(2.8) 


where  V(i)  is  calculated  from  Equation  2.1  with  A  =  17  and  B  =  255,  and  i  is  the  index 
of  the  current  frame.  If  VUS_Score(i)  =  1,  the  frame  is  classified  as  unvoiced,  and  if 
VUS_Score(i)  =  0,  the  frame  is  classified  as  silent.  Note  that  this  method  only  separates 
unvoiced  from  silent  frames.  The  value  of  VUS_Score(i)  is  arbitrarily  set  equal  to  two  for 
all  voiced  frames.  As  a  result,  the  VAJ/S  feature  score  is  different  from  many  of  the  other 
feature  scores  in  two  ways:  First,  it  spans  the  range  [0,2]  instead  of  [0,1].  Second,  it  can 
have  only  one  of  three  discrete  values,  while  most  of  the  other  feature  scores  are 
continuous. 

It  is  found  (empirically)  that  the  best  results  are  obtained  when 

Tu/S  =  BNPn^ean  +  k  *  BNP,^  ^^  (2.9) 

where  k  =  2.0.  Obviously,  the  value  used  for  k  is  dependent  upon  the  statistical  properties 
of  the  background  noise  in  tiie  speech  signal.  However,  the  absolute  level  of  the 
background  noise  is  compensated  for  automatically  since  the  value  of  BNPmean  is 
calculated  before  analysis  of  each  word. 

Figure  2-4  shows  the  VAJ/S  classification  for  the  word  "sue"  spoken  by  a  male 
speaker:  a)  shows  tiie  time-domain  speech  waveform,  b)  shows  die  volume  function  (in 
dB)  from  Equation  2.1  and  the  fixed  threshold,  Ty/s,  and  c)  shows  the  resulting  VAJ/S 
score. 

2.3.4  Automatic  Correction  Rules 

The  feamre  scores  produced  by  Equation  2.4  (or  Equation  2.5)  sometimes  require 
additional  processing  to  help  eliminate  false  detection  of  features.  The  processing  is 
accomplished  by  the  application  of  pattern  recognition  rules.    These  rules  are  used 
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a)  Time-domain  waveform; 

b)  Volume  function  and  Tu/s  threshold  (threshold  shown  as  dashed  line); 

c)  VAJ/S  score. 
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sparingly,  and  counteract  specific,  regularly  occurring,  incorrect  classifications  of  feature 
types.  They  are  developed  empirically  on  an  algorithm-by-algorithm  basis. 

In  the  case  of  the  V/U/S  detector,  the  unprocessed,  or  "raw,"  VAJ/S  score  produced 
by  Equation  2.8  has  the  undesirable  effect  of  sometimes  oscillating  between  states  during 
word  onsets  and  offsets,  as  well  as  during  transitions  between  unvoiced  and  voiced  regions. 
Since  this  does  not  accurately  model  the  human  speech  process,  rules  are  applied  to  smooth 
the  VAJ/S  score,  or  "track."  Table  2-1  lists  the  rules. 


Table  2-1 .  Rules  to  modify  initial  voiced /unvoiced/  silent  (V/U/S)  results. 
The  symbols  x  and  y  denote  any  segment  type. 


Rule 
Number 

Initial  Pattern 

Requirements 
for  Modification 

Final  Pattern 

1 

VUS 

length  V  >  100.0  ms, 
length  U<  25.1ms 

vvs 

2 

xSy 

length  S<  10.1ms 

X  xy 

3 

SUV 

length  U  <  7.5  ms 

SSV 

4 

X  U  y  (except  SUV) 

length  U  <  10.0  ms 

X  y  y  (if  X  =  S) 
else 
X  xy 

The  first  rule  eliminates  an  incorrect  unvoiced  classification  at  the  end  of  a  long 
voiced  segment.  Since  the  energy  often  decreases  quickly  during  the  last  few  glottal  cycles 
of  a  voiced-silent  transition,  the  classification  algorithm  sometimes  (incorrectly)  labels 
these  pitch  periods  as  unvoiced.  The  second  rule  smooths  out  momentary  "drop  outs"  that 
occur  when  the  signal  level  drops  below  the  Ty/s  threshold  for  a  brief  period  of  time.  The 
third  rule  is  similar  to  the  first  rule,  except  that  it  smooths  out  the  beginning  of  the  segment, 
instead  of  the  end.  It  reclassifies  the  very  short,  low  energy  frame  at  the  beginning  of  a 
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silence-unvoiced-voiced  transition  from  unvoiced  to  silent.  The  fourth  rule  smooths  out 
momentary  unvoiced  segments  of  very  short  duration.  This  rule  does  not  eliminate  the 
short  noise  bursts  exhibited  by  plosives,  since  it  only  acts  if  the  unvoiced  segment  is  less 
than  10  ms,  which  is  far  shorter  than  the  average  duration  of  the  plosive  burst  (Klatt,  1979; 
Umeda,  1977). 

Figure  2-5  shows  the  VAJ/S  score  for  the  word  "sue"  spoken  by  a  male  speaker 
before  and  after  die  application  of  the  pattern  recognition  rules.  The  rules  reduce  the 
number  of  non-silent  segments  from  five  to  two. 

2.3.5  Summary 

Each  feature  detection  algorithm  is  comprised  of  a  similar  sequence  of  processing 
stages.  In  general,  the  first  stage  calculates  one  or  more  volume  functions.  The  volume 
function  (or  ratio  of  volume  functions)  is  smoothed  by  a  median  filter  to  remove  any 
short-term  fluctuations.  The  second  stage  calculates  a  feature  score,  typically  over  the 
range  [0, 1],  by  comparing  die  volume  function  with  one  or  two  fixed  thresholds.  The  tiiird 
stage  applies  pattern  recognition  rules  to  correct  for  any  known  deficiencies  in  the 
algorithms. 

The  following  section  details  the  individual  feature  detection  algoritiims.  Each  of 
the  detectors  in  Figure  2-2  is  described,  and  any  differences  from  the  V/U/S  example 
detector  are  discussed. 

2.4  Feature  Detection  Algorithms— Detailed  Descriptions 

This  section  gives  the  details  of  the  feature  detection  algoritiims.  The  algorithms 
are,  for  the  most  part,  similar  in  form  to  die  V/U/S  feature  detection  algoritiims  described 
in  Section  2.3. 
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Figure  2-5.      Voiced  /  unvoiced  /  silent  (VAJ/S)  classification  for  the  word ' 
Segment  durations  are  indicated  in  number  of  samples. 

a)  Time-domain  waveform; 

b)  Before  pattern  recognition  rules; 

c)  After  pattern  recognition  rules. 
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2.4.1  Sonorant  Detection 

To  be  classified  as  sonorant,  a  frame  must  be  voiced,  and  must  also  have  a  high  ratio 
of  low  frequency  to  high  frequency  energy.  The  group  of  sonorants  typically  include 
vowels,  voice  bars,  nasals,  and  semivowels.  The  non-sonorants  include  unvoiced 
fricatives,  unvoiced  stops,  and  strong,  voiced  fricatives.  Weak  voiced  fricatives  are 
classified  as  sonorant  if  they  have  a  relatively  large  proportion  of  low-fi«quency  energy 
(Weinstein  et  al.,  1975). 

The  volume  function  from  Equation  2. 1  is  calculated  for  each  frame  with  G  =  1 , 
A  =  5,andB  =  46.  This  is  termed  the  low  frequency  volume  function,  or  LFV.  TheLFV 
is  equivalent  to  a  bandpass  filter  from  98  Hz  to  898  Hz.  A  second  volume  function  from 
Equation  2.1  is  calculated  for  each  frame  with  G  =  1,  A  =  189,  and  B  =  255.  This  is 
termed  the  high  frequency  volume  function,  or  HFV.  The  HFV  is  equivalent  to  a  bandpass 
filter  from  3691  Hz  to  5000  Hz.  The  sonorant  ratio,  R(i),  is  calculated  for  each  frame  as 

„,.,       LFV(i) 

^«  =  IffVCl)  (2.10) 

where  i  is  the  index  of  the  current  frame. 

The  sonorant  ratio  is  then  smoothed  by  a  fifth-order  median  filter.  The  smoothed 
sonorant  ratio  is  compared  to  a  threshold,  Tgon,  and  a  binary  (zero  or  one)  sonorant  score, 
SS(i),  is  calculated  for  each  frame  as 


SS(i)  =  ■ 


0,  if  R(i)  <  Tson 

1,  if   R(i)  ^  Tson 


(2.11) 


where  i  is  the  frame  index,  and  Tson  =  10.  The  threshold  Tson  is  determined  empirically. 
Figure  2-6  shows  the  sonorant  detection  results  for  the  word  "sue"  spoken  by  a 
male  speaker.  The  threshold  Tson  =  10  is  shown  as  a  dashed  line  in  Figure  2-6b.  Note 
that  the  sonorant  ratio  is  nearly  zero  for  the  entire  duration  of  the  /s/. 
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Figure  2-6.      Sonorant  detection  for  the  word  "sue." 

a)  Time-domain  waveform; 

b)  Sonorant  ratio  and  threshold; 

c)  Sonorant  score. 
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2.4.2  Vowel  Detection 

Vowel  detection  is  accomplished  in  a  manner  similar  to  that  for  sonorant  detection. 
A  LFV  function  from  Equation  2.1  is  calculated  with  G  =  1,  A  =  1,  and  B  =  51.  The 
LFV  is  equivalent  to  a  bandpass  filter  from  20  Hz  to  996  Hz.  A  HFV  function  from 
Equation  2.1  is  calculated  for  G  =  1,  A  =  52,  and  B  =  255.  The  HFV  is  equivalent  to 
a  bandpass  filter  from  1016  Hz  to  5000  Hz.  A  vowel  ratio,  VWL(i),  is  calculated  for  each 
frame  by 


VWL(i)  = 


LFV(i) 
HFV(i) 


(2.12) 


where  i  is  the  frame  index.  The  vowel  ratio  is  then  smoothed  with  a  fifth-order  median 
filter.  A  vowel  score,  VWLS(i),  within  the  continuous  range  [0,1]  is  calculated  for  each 
frame  by  comparing  the  smoothed  vowel  ratio  with  two  thresholds.  The  score  is  given  by 


VWLS(i)  =  ' 


0. 

1. 

Tupper  -  VWL(i) 


upper 


-  T 


lower 


if    VWL(i)  >  Tapper 

if  VWL(i)  <  Ti,^„ 


if  Ti„^^  <  VWL(i)  <  T 


upper 


•    (2.13) 


where  Tupper  =  18  and  Tj^^^^  =  8.  The  two  thresholds  are  determined  empirically. 

In  a  final  processing  stage,  the  vowel  score  is  automatically  set  to  zero  for  all  frames 
in  any  vowel  segment  that  is  150  samples  (15  ms)  or  less  in  length.  This  helps  to  reduce 
false  vowel  detection. 

Figure  2-7  shows  the  vowel  detection  results  for  the  word  "said"  spoken  by  a  male 
speaker.  The  two  thresholds  are  shown  as  dashed  lines  in  Figure  2-7b.  Note  that  the  voiced 
/d/  exhibits  a  high  vowel  score  for  a  short  duration.  Since  there  is  no  voiced  stop  segment 
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Figure  2-7.      Vowel  detection  for  the  word  "said." 

a)  Time-domain  waveform; 

b)  Vowel  ratio  and  thresholds; 

c)  Vowel  score. 
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category  in  this  study,  voiced  stops  are  typically  classified  as  either  a  vowel-unvoiced  stop 
sequence,  or  a  vowel-unvoiced  fricative  sequence. 

2.4.3  Voiced  Consonant  Detection 

Voiced  consonant  detection  is  accomplished  in  a  manner  abnost  identical  to  vowel 
detection.  A  LFV  function  from  Equation  2.1  is  calculated  with  G  =  1,  A  =  1,  and 
B  =  51.  The  LFV  is  equivalent  to  a  bandpass  filter  from  20  Hz  to  996  Hz.  A  HFV  function 
from  Equation  2.  lis  calculated  for  G  =  1,  A  =  52,andB  =  255.  The  HFV  is  equivalent 
to  a  bandpass  filter  from  1 0 1 6  Hz  to  5000  Hz.  These  filter  values  are  the  same  as  those  used 
for  vowel  detection.  A  voiced  consonant  ratio,  VC(i),  is  calculated  for  each  frame  as 


VC(i)  = 


LFV(i) 
HFV(i) 


(2.14) 


where  i  is  the  frame  index.  The  voiced  consonant  ratio  is  then  smoothed  with  a  fifth-order 
median  filter.  A  voiced  consonant  score,  VCS(i),  within  the  continuous  range  [0,1]  is 
calculated  for  each  frame  by  comparing  the  smoothed  voiced  consonant  ratio  with  two 
thresholds.  The  score  is  given  by 


VCS(i)  =  - 


1, 
0, 

VC(i)  -  T,„,^ 


upper 


lower 


if    VC(i)  >  Tapper 

if  VC(i)  <  Ti„^^ 


if  T,„^^  <  VC(i)  <  T, 


UppCT 


(2.15) 


where  Tupper  =  18  and  T^^^  =  8.  The  thresholds  are  determined  empirically. 

Note  that  VCS  can  be  calculated  during  the  VWLS  calculation  for  each  frame,  since 
VCS  =  1  -  VWLS,  provided  that  the  value  of  VWLS  is  used  before  the  short  segment 
(<  15  ms)  vowel  detection  and  elimination  of  Section  2.4.2  is  done.  This  is  because  the 
same  filters  and  thresholds  are  used  for  both  vowel  and  voiced  consonant  detection. 
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However,  calculating  VCS  directly  from  VWLS  eliminates  the  possibility  of  future 
experiments  with  the  filter  characteristics  and  thresholds  for  voiced  consonant  detection 
independent  of  the  vowel  detection  algorithm. 

Figure  2-8  shows  the  voiced  consonant  detection  results  for  the  word  "said" 
spoken  by  a  male  speaker.  The  two  thresholds  are  shown  as  dashed  lines  in  Figure  2-8b. 
The  voice  bar  that  occurs  before  the  release  of  the  /d/  is  clearly  classified  as  a  voiced 
consonant.  However,  the  algorithm  assigns  a  score  of  slightiy  greater  than  0.5  to  the  initial 
portion  of  the  release  of  the  /d/.  This  shows  the  difficulty  of  detecting  voiced  stops  using 
a  single  acoustic  feature. 

2.4.4  Voice  Bar  Detection 

Voice  bar  detection  is  accompUshed  in  a  manner  similar  to  both  vowel  and  voiced 
consonant  detection.  A  LFV  function  from  Equation  2.1  is  calculated  with  G  =  1, 
A  =  1 ,  and  B  =  33.  The  LFV  is  equivalent  to  a  bandpass  filter  from  20  Hz  to  645  Hz. 
A  HFV  function  from  Equation  2.  lis  calculated  for  G  =  1,  A  =  34,and  B  =  255.  The 
HFV  is  equivalent  to  a  bandpass  filter  from  664  Hz  to  5000  Hz.  A  voice  bar  ratio,  VB(i), 
is  calculated  for  each  frame  as 

VBO)  =  ^  (2.16) 

where  i  is  the  frame  index.  The  voice  bar  ratio  is  then  smoothed  with  a  fifth-order  median 
filter.  A  voice  bar  score,  VBS(i),  within  the  continuous  range  [0,1]  is  calculated  for  each 
frame  by  comparing  the  smoothed  voice  bar  ratio  witii  two  thresholds.  The  score  is  given 
by 
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Figure  2-8.      Voiced  consonant  detection  for  the  word  "said. 

a)  Time-domain  waveform; 

b)  Voiced  consonant  ratio  and  thresholds; 

c)  Voiced  consonant  score. 
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VBS(i)  =  « 


1, 

if    VB(i)  >  Tapper 

0, 

if  VB(i)  <  Ti,,„ 

VB(i) 

A  upper 

—  T 

lower 

—  T          ' 

^loWCT 

if  Ti„^^  <  VB(i) 

(2.17) 


where  Tupper  =  30  and  Tj^^^^j.  =  10.  The  thresholds  are  determined  empirically. 

In  a  final  processing  stage,  the  voice  bar  score  is  automatically  set  to  zero  for  all 
frames  in  any  voice  bar  segment  that  is  300  samples  (30  ms)  or  less  in  length.  This  helps 
to  reduce  false  voice  bar  detection. 

Figures  2-9  and  2-10  show  voice  bar  detection  for  the  words  "said"  and  "bond," 
respectively,  spoken  by  a  male  speaker.  The  two  thresholds  are  shown  as  dashed  lines  in 
Figures  2-9b  and  2-lOb.  In  Figure  2-9,  the  voice  bar  before  the  release  of  the  /d/  is  clearly 
detected.  Figure  2-10  shows  both  the  voice  bar  of  the  initial  /b/,  and  the  voice  bar 
associated  with  the  final  /d/,  for  the  word  "bond."  Also  note  in  Figures  2-9  and  2-10  that 
the  voice  bar  associated  with  the  final  /d/  is  much  shorter  in  the  word  "bond"  than  in  the 
word  "said"  due  to  the  preceding  nasal  in  "bond." 

2.4.5  Formant  Tracking 

Formant  tracking  is  accomplished  in  a  manner  completely  different  from  the 
typical  detection  process.  The  output  of  the  algorithm  is  also  different  from  the  other 
feature  detector  outputs.  Actually,  the  formant  tracking  algorithm  is  a  "front-end" 
processor  for  the  nasal  detection  algorithm,  since  the  nasal  detector  does  not  use  volume 
functions,  but  rather  a  ratio  of  the  amplitudes  of  the  first  two  formant  frequencies  to 
determine  the  nasal  feature  score. 

Formant  tracking  is  accomplished  by  an  algorithm  developed  by  McCandless 
(1974).  Only  a  brief  description  is  given,  since  the  algorithm  is  documented  elsewhere. 
Only  the  voiced  portions  of  the  speech  signal  are  analyzed  to  estimate  the  formant  tracks. 
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Figure  2-9.      Voice  bar  detection  for  the  word  "said." 

a)  Time-domain  waveform; 

b)  Voice  bar  ratio  and  thresholds; 

c)  Voice  bar  score. 
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Figure  2-10.    Voice  bar  detection  for  the  word  "bond." 

a)  Time-domain  waveform; 

b)  Voice  bar  ratio  and  thresholds; 

c)  Voice  bar  score. 
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The  algorithm  attempts  to  find  a  best  match  between  the  peaks  of  the  frequency 
response  obtained  from  the  filter  produced  from  the  LPC  coefficients,  and  estimates  for  the 
first  four  foraiant  frequencies.  The  amplitudes  of  the  first  four  formant  peaks  are  also 
estimated.  Initially,  the  estimates  for  the  four  formant  frequencies  are  set  to  values  that  are 
typical  for  the  male  voice  (or  the  female  voice,  if  female  speech  is  being  analyzed).  For 
the  male  voice,  the  initial  estimates  are  fj  =  320  Hz,  f  2  =  1440  Hz,  fg  =  2760  Hz,  and 
f4  =  3200  Hz.  For  the  female  voice,  the  initial  estimates  are  fj  =  480  Hz, 
f2  =  1760  Hz,  fg  =  3200  Hz,andf4  =  3520  Hz.  The  algorithm  matches  each  peak  of 
the  frequency  response  of  the  LPC  filter  with  the  closest  formant  frequency  estimate.  The 
estimates  for  the  formant  frequencies  are  updated  after  each  frame  of  speech  is  processed, 
provided  that  a  match  has  been  made. 

In  any  given  frame,  if  there  is  no  match  between  the  LPC  filter  peaks  and  the 
formant  frequency  estimates,  the  algorithm  attempts  to  increase  spectral  resolution  by 
iteratively  evaluating  the  "frequency  response"  of  the  LPC  filter  on  a  circle  in  the  z-plane 
with  a  radius  of  less  than  one.  This  is  done  by  evaluating  the  z-transform  of  the  LPC  filter 
with  z  =  re-i  ^,  where  r  denotes  the  radius  of  the  circle.  The  initial  radius  value  is  unity, 
and  is  decreased  by  0.004  during  each  iteration  until  a  match  is  obtained,  or  until  the  radius 
is  less  than  0.88.  This  procedure  is  able  to  resolve  two  closely  spaced  poles  if  they  are 
relatively  close  to  the  unit  circle.  If  the  radius  is  reduced  to  less  than  0.88  and  a  match  is 
still  not  found  for  all  of  the  formant  frequencies  in  the  frame,  the  algorithm  reevaluates  the 
matches  it  has  made  for  the  frame.  During  the  reevaluation,  the  algorithm  either  changes 
the  matches  it  has  made,  or  assigns  a  zero  value  to  the  formant  frequency  in  question  for 
that  particular  frame. 

The  final  results  are  smoothed  by  checking  each  of  the  first  three  formant  frequency 
tracks  individually  for  any  zero  values  (this  is  not  a  part  of  the  McCandless  algorithm). 
If  the  formant  frequency  is  zero  for  either  one  or  two  (consecutive)  frames,  the  frequency 
and  amplitude  are  linearly  interpolated  and  the  zero  values  are  removed. 
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The  first  three  estimated  formant  frequency  and  estimated  formant  amplitude 
tracks  for  the  voiced  portion  of  the  word  "meat"  spoken  by  a  male  speaker  are  shown  in 
Figure  2-11.  Although  only  the  first  two  formant  tracks  are  used  in  this  study,  the  first  three 
formants  were  retained  for  future  work. 


2.4.6  Nasal  Detection 

Nasal  detection  is  done  by  comparing  the  estimated  amplitudes  of  the  first  two 
formants  obtained  in  the  McCandless  formant  tracker.  A  nasal  ratio,  N(i),  is  calculated  for 
each  frame  as 


N(i)  = 


A2(i) 
Al(i) 


(2.18) 


where  A 1  (i)  is  the  estimated  first  formant  amplitude,  A2(i)  is  the  estimated  second  formant 
amplitude,  and  i  is  the  current  frame  index  (Mermelstein,  1977).  The  nasal  ratio  is  then 
smoothed  by  a  fifth-order  median  filler.  A  nasal  score,  NS(i),  within  the  continuous  range 
[0,1]  is  calculated  for  each  frame  by  comparing  the  smoothed  nasal  ratio  with  two 
thresholds.  The  score  is  given  by 


NS(i)  = 


0. 

1, 

A  upper 

-N(i) 

1  upper 

-  T 

lower 

if    N(i)  >  Tuppe, 

if  N(i)  <  Ti,^„ 


if    T,„^^  <  N(i)  <  Tupper 


(2.19) 


where  Tupper  =  0.20  and  T^^^^  =  0.05.  The  thresholds  are  determined  empirically. 

Additional  processing  is  done  by  the  application  of  pattern  recognition  rules  to 
distinguish  nasals  from  other  segment  types.  First,  if  a  frame  has  a  voice  bar  score  greater 
than  0.75,  the  nasal  score  for  that  frame  is  set  to  zero.  This  is  because  strong  voice  bars 
typically  exhibit  strong  nasal  scores.  The  opposite,  however,  is  not  true.  The  nasal  score 
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Figure  2-1 1 .    McCandless  formant  tracker  for  the  word  "meat.' 

a)  Time-domain  waveform; 

b)  Amplitudes  of  the  first  three  formants; 

c)  Center  frequencies  of  the  first  three  formants. 
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is  then  set  to  zero  if  the  frame  has  a  zero  sonorant  score.  This  is  done  to  prevent 
non-sonorant  frames  from  being  classified  as  nasal.  Finally,  the  nasal  score  for  all  frames 
in  any  continuous  nasal  segment  that  is  less  than  25  ms  in  length  are  set  to  zero. 

Figure  2-12  shows  nasal  detection  for  the  word  "bond"  spoken  by  a  male  speaker. 
The  two  thresholds  are  shown  as  dashed  lines  in  Figure  2-12b.  Note  that  the  nasal  score 
rises  slowly  after  the  transition  from  the  vowel  to  the  nasal.  This  shows  that  the 
formant-amplitude-based  algorithm  is  not  as  accurate  as  the  other  feature  detection 
algorithms  that  are  based  upon  the  short-term  frequency  response.  Still,  the  algorithm  is 
able  to  correcdy  identify  the  nasal  region. 

2.4.7  Semivowel  Detection 

Semivowel  detection  is  inspired  by  a  method  developed  by  Espy- Wilson  (1986). 
The  algoritiim  deviates  slighdy  from  the  standard  detector,  although  it  uses  the  volume 
functions  from  Equation  2. 1 .  A  LEV  function  from  Equation  2. 1  is  calculated  with  G  =  1 , 
A  =  1,  and  B  =  20.  The  LEV  is  equivalent  to  a  bandpass  filter  from  20  Hz  to  391  Hz. 
A  HFV  function  from  Equation  2.1  is  calculated  f or  G  =  1,  A  =  21,  and  B  =  50.  The 
HFV  is  equivalent  to  a  bandpass  filter  from  410  Hz  to  977  Hz.  A  murmur  ratio,  MUR(i), 
is  calculated  for  each  frame  as 

^^^^  =  ^1  (2.20) 

The  murmur  ratio  is  then  smoothed  with  a  fifth-order  median  filter.  A  murmur  score, 
MS(i),  within  the  continuous  range  [0,1]  is  calculated  for  each  frame  by  comparing  the 
smoothed  murmur  ratio  with  two  thresholds.  The  score  is  given  by 
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Figure  2-12.    Nasal  detection  for  the  word  "bond." 

a)  Time-domain  waveform; 

b)  Nasal  ratio  and  thresholds; 

c)  Nasal  score. 
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MS(i)  =  . 


1. 

if    MUR(i)  >  Tupper 

0, 

if  MUR(i)  <  T,,,„ 

MUR(i)  -  Ti,_ 

Tupper  -  Tio^gj 

if  Ti„^„  <  MUR(i) 

■ujjper 


(2.21) 


where  Tupper  -  12  and  Tj^^^^^  =  4.    The  thresholds  are  determined  empirically.    The 
semivowel  score,  SVS(i),  is  then  calculated  for  each  frame  as 


SVS(i)  =  [1  -  MS(i)][l  -  VBS(i)]VCS(i) 


(2.22) 


where  i  is  the  frame  index,  VBS(i)  is  the  voice  bar  score  from  Section  2.4.4,  and  VCS(i) 
is  the  voiced  consonant  score  from  Section  2.4.3.  The  value  of  S VS  is  limited  to  the  range 
[0,1].  If  SVS  is  greater  than  one,  it  is  set  to  unity  for  the  frame.  Equation  2.22  shows  that 
if  a  frame  has  a  high  voiced  consonant  score,  a  low  murmur  score,  and  a  low  voice  bar  score, 
it  will  have  a  high  semivowel  score. 

Additional  processing  is  done  to  smooth  SVS.  If  the  frame  has  a  nasal  score  greater 
than  0.5,  SVS  is  set  to  zero  for  that  frame.  This  is  because  some  strong  nasals  get  labeled 
as  semivowels.  In  addition,  the  semivowel  scores  for  all  of  the  frames  in  any  continuous 
semivowel  segment  less  than  30  ms  in  duration  are  set  to  zero.  This  eliminates  mislabeling 
of  short  segments  that  are  not  semivowels. 

Figure  2-13  shows  semivowel  detection  for  the  word  "wield"  spoken  by  a  male 
speaker.  The  algorithm  correctly  detects  the  leading  /w/  as  well  as  the  /I/  near  the  end  of 
the  word  immediately  preceding  the  release  of  the  plosive  /d/.  Note  that  listening  reveals 
that  the  actual  release  in  this  example  more  closely  resembles  an  unvoiced  N  rather  than 
a  voiced  /d/.  Also  note  that  the  post-processing  does  not  detect  any  nasal  segments  nor  any 
semivowel  segments  less  than  30  ms  long.  Therefore,  the  results  for  both  before  and  after 
post-processing  are  the  same. 
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Figure  2-1 3.    Semivowel  detection  for  the  word  "wield." 

a)  Time-domain  waveform; 

b)  Semivowel  score  before  post-processing; 

c)  Semivowel  score  after  post-processing. 
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2.4.8  Voiced  Fricative  Detection 

The  voiced  firicative  detection  algorithm  deviates  from  the  standard  detector, 
although  it  does  calculate  feature  scores  from  fixed  thresholds.  The  first  step  in  voiced 
fricative  detection  is  to  add  preemphasis  to  the  frequency  response  of  the  filter  produced 
by  the  LPC  coefficients.  In  other  studies,  the  typical  preemphasis  method  is  to  calculate 
the  first  difference  of  the  sampled  data  waveform  before  calculating  the  LPC  coefficients. 
In  this  study  the  first  difference  is  not  calculated  before  the  LPC  analysis,  so  the  first 
difference  function  is  approximated  by  a  weighting  function,  W,  in  the  frequency  domain 
given  by 

WCeJ'^)  =  ^,  0  <  m  <  255  (2.23) 

The  magnitude  of  the  weighting  function's  frequency  response  is  within  3  dB  of  the 
magnitude  of  the  frequency  response  of  a  first-order  differentiator  for  all  frequencies  in  the 
filter  passband.  The  preemphasized  frequency  response  for  frame  i,  Hj,  is 

Hi(eJ"^)  =  WCeJ'^^  HjCeJ'^^,  0  <  m  <  255  (2.24) 

where  Hi  is  calculated  from  Equation  2.2  for  frame  i  with  G  =  1.  The  mean  frequency 
of  the  preemphasized  frequency  response,  MF(i),  is  then  found  for  each  frame  as 

m  =  255 

'^'"  =  IW    2:   (  ^  I    'fti(^"*)l   1  (2.25) 

^ '   m  =  0 

where  fs  =  10  kHz,  and  i  is  the  frame  index.  Htotai(i)  is  given  for  fi:ame  i  as 

m  =  255 

Htotai(i)=     X    IHi(eJ''^l  (2.26) 


m  =  0 
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MF(i)  is  then  smoothed  by  a  third-order  median  filter  A  high  ftiequency  score,  HFS(i),  is 
calculated  for  each  fi^me  as 


HFS(i)  =  ' 


1, 

if    MF(i)  >  Tupper 

0, 

if  MF(i)  <  Ti„^„ 

MF(i) 

1  upper 

—  T 

lower 

—  T         ' 

*  lower 

if  Ti„^„  <  MF(i) 

(2.27) 


where  Tupper  =  3200,  and  T^^^^  =  2400.  The  thresholds  are  determined  empirically. 

The  voiced  firicative  score,  VFS(i),  is  then  calculated  using  HFS(i).  If  the  frame 
is  voiced  and  the  sonorant  score  is  zero,  then  VFS(i)  =  1  for  the  frame.  This  is  done 
because  the  frame  is  voiced  and  also  has  a  relatively  large  amount  of  high  frequency  energy 
(i.e.  the  frame  is  non- sonorant).  If  the  frame  is  voiced  and  the  sonorant  score  is  1,  then 
VPS(i)  =  HFS(i)  for  the  frame.  In  this  case,  the  voiced  fricative  score  depends  solely 
upon  the  frame's  high  frequency  energy  distribution.  The  final  step  is  to  set  VFS(i)  to  zero 
for  all  frames  in  any  voiced  fricative  segment  less  than  15  ms  in  duration.  This  is  done  to 
eliminate  false  detection  of  short  segments. 

Figure  2-14  shows  the  results  of  voiced  fricative  detection  for  the  word  "zoo" 
spoken  by  a  male  speaker.  Examination  of  the  spectrogram  (not  shown)  reveals  that  there 
is  little  high  frequency  energy  at  the  beginning  of  the  /z/.  This  explains  why  the  algorithm 
does  not  classify  the  beginning  of  the  ItJ  as  a  voiced  fricative. 

2.4.9  Unvoiced  Stop  and  Fricative  Detection 

If  a  frame  is  classified  as  unvoiced,  it  is  either  an  unvoiced  stop  or  an  unvoiced 
fricative.  The  algorithm  used  to  distinguish  between  the  two  segment  types  differs  from 
the  standard  feature  detector,  and  uses  both  time-based  and  frequency-based  parameters. 
First,  the  mean  frequency  is  calculated  for  each  frame  from  Equations  2.23  through  2.26. 
The  mean  frequency  track  is  smoothed  by  a  third-order  median  filter.  The  high  frequency 
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Figure  2-14.    Voiced  fricative  detection  for  the  word  "zoo." 

a)  Time-domain  waveform; 

b)  High  frequency  score; 

c)  Voiced  fricative  score. 
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score  is  then  calculated  for  each  frame  from  Equation  2.27  with  Tupper  =  3800  and 
Tiower  =  2400.  The  base- ten  logarithm  of  the  power,  Piogio,  is  calculated  from  the  initial 
LPC  analysis  results  of  each  frame.  Next,  all  adjacent  unvoiced  frames  are  grouped  into 
segments.  For  example,  the  word  "sit"  has  two  unvoiced  segments,  /s/  and  /t/,  and  each 
of  these  unvoiced  segments  is  comprised  of  multiple,  adjacent,  unvoiced  frames. 

The  slope  of  Piogio  of  the  initial  twelve  frames  (60  ms)  of  each  unvoiced  segment 
is  examined.  If  the  segment  is  shorter  than  twelve  frames,  all  of  the  frames  are  used.  A 
first-order  approximation  (i.e.  a  straight  line),  Mseg(j),  of  the  slope  of  Piogio  is  calculated 
for  each  segment  j  using  die  MATLAB  function  "polyfit."  This  is  a  least-squares  fit.  A 
segment  slope  score,  MSseg(j)>  is  computed  for  each  segment  from  MsegO)  by 


MSsegCJ)  =  ' 


0. 

if 

MsegO)  ^  Tupper 

1. 

if 

MsegO)   <  T,,^„ 

1  upper 
i  upper 

-  MsegO) 

-  T          ' 

loWCT 

if 

Tlower  <  MsegO) 

upper 


'      (2.28) 


where  Tupper  =  1-0,  Tj^^^^^  =  -  1.0,  andj  is  the  index  of  the  current  ^egmenr.  The"seg" 
subscript  is  included  to  draw  attention  to  the  fact  that  the  slope  score  is  calculated  as  a  single 
value  for  the  entire  unvoiced  segment.  All  of  the  frames  in  a  given  unvoiced  segment  are 
assigned  the  same  MSseg  value.  The  frame  slope  score  is  denoted  as  MS(i).  Thus, 
MS(i)  =  MSsegO)  for  each  frame  i  in  segment  j. 

Calculation  of  the  unvoiced  stop  score,  USS(i),  takes  advantage  of  the  fact  that 
unvoiced  stops  are  inherently  shorter  in  duration  than  unvoiced  fricatives  (Cole  and 
Cooper,  1975;  Klatt,  1979;  Umeda,  1977).  The  unvoiced  stop  score,  USS(i),  is  given  for 
each  frame  by 


USS(i)  =  KsMS(i) 


(2.29) 


where  i  is  the  current  frame  index,  and  Ks  is  given  as 
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K.  = 


GJ  1  + 


( 


T        —  T 

^  stop        ^j 

T 

^stop 


1  + 


-1 


^j  "^  Tjiop 


Tsu>p  ^  Lj  <  Tj^ 


Lj  >  Tine 


(2.30) 


where  Gs  =  8.0,  Tstop  =  50  ms,  T^^  =  80  ms,  and  Lj  denotes  the  length  of  the 
unvoiced  segment  j  (in  milliseconds).  The  two  thresholds  and  the  gain,  Gs,  are  all 
determined  empirically.  The  term  Kg  acts  as  a  duration-dependent  scale  factor  that  greatiy 
amplifies  the  stop  score  for  unvoiced  segments  less  than  50  ms  long.  To  a  lesser  degree, 
Ks  also  attenuates  the  stop  score  for  unvoiced  segments  greater  than  80  ms  long.  The 
unvoiced  stop  score,  USS(i),  is  then  limited  to  the  range  [0,1].  If  USS(i)  is  greater  than 
one  for  a  given  frame,  it  is  set  to  unity  for  the  fiame. 

The  final  unvoiced  fricative  score,  UFS(i),  is  calculated  for  each  frame  by 


UFS(i)  =  HFS(i) 


(2.31) 


which  is  simply  the  high  frequency  score  for  the  frame. 

Figure  2-15  shows  both  the  unvoiced  stop  and  the  unvoiced  fricative  scores  for  the 
word  "pest"  spoken  by  a  male  speaker.  The  /p/  is  correctly  detected  as  an  unvoiced  stop, 
the  /s/  is  correctly  detected  as  an  unvoiced  fricative,  and  the  /t/  has  both  a  high  unvoiced 
stop  score  and  a  high  unvoiced  fricative  score.  This  is  because  of  the  high  mean  frequency 
of  the  /t/.  Note  in  Figure  2- 15b  that  the  /t/  is  incorrectly  split  into  two  different  segments, 
which  is  an  error  caused  by  the  V/U/S  algorithm.  Still,  despite  the  VAJ/S  cttot,  the  /t/  has 
a  greater  unvoiced  stop  score  than  unvoiced  fricative  score,  which  is  desirable. 
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Figure  2-15.    Unvoiced  fricative  and  stop  detection  for  the  word  "pest. 

a)  Time-domain  waveform; 

b)  Unvoiced  stop  score; 

c)  Unvoiced  fricative  score. 
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2.5  Speech  Segmentarion 

The  algorithms  described  in  the  previous  section  focus  primarily  upon  the  acoustic 
features  associated  with  individual  frames.  However,  in  order  to  segment  and  label  the 
speech  into  the  segment  categories  defined  in  Section  2.1,  the  boundaries  between  the 
phoneme  segments  must  be  determined.  This  is  done  with  two  algorithms  that  are 
described  in  the  following  subsections.  The  results  from  the  two  algorithms  are  then 
combined  to  determine  the  final  segment  boundaries  and  durations. 

2.5.1   Spectral-Based  Boundary  Detection  and  Segmentation 

The  first  segmentation  algorithm  is  based  upon  changes  in  the  short-term  ft-equency 
spectra  of  the  speech  signal.  It  uses  an  algorithm  developed  by  Glass  and  Zue  (1986)  that 
measures  the  similarity  between  a  current  frame  and  its  neighbors.  To  do  so,  the  absolute 
value  of  the  frequency  response  of  the  filter  produced  by  the  LPC  coefficients  from 
Equations  2.2  and  2.3  is  calculated  for  each  frame.  A  Euclidian  distance  measure,  D(x,y), 
is  defined  as 


m  =  255 

1 

m  =  0 


D(x,y)  =     Y    I   IHx(eJ^)l  -  IHy(eJ'^)l   I  (2.32) 


where  x  is  the  current  frame  index,  y  is  a  past  or  future  frame  index,  and  Hx(e-''^^)  is  the 
single-sided  fi-equency  response  evaluated  for  frame  x  at  the  points  expO'Jtm/256)  for 

0  <  m  <  255.  From  Glass  and  Zue  (1986),  the  decision  strategy  is  to  associate  the  current 
frame  x  with  past  frames  if 

max(  D(x,y)   )  <  min(  D(x,v)   ),  x-4<y<x-2 

x-l-2<v<x-f-4         (2.33) 

and  to  associate  the  current  frame  x  with  future  frames  if 
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min(  D(x,y)   )  >  max(  D(x,v)   ),  x-4<y<x-2 

x  +  2<v<x  +  4         (^•^^) 

No  association  (a  "don't  care"  state)  is  made  if  neither  of  these  conditions  are  met.  After 
each  frame  is  associated  with  one  of  the  three  states,  a  segment  boundary  is  determined  to 
occur  whenever  the  current  frame's  association  changes  from  the  past  to  the  future.  The 
location  of  the  boundary  is  at  the  first  sample  of  the  frame  where  the  transition  to  the  future 
occurs.  Post  processing  is  also  done  to  remove  any  boundaries  that  occur  in  the  middle  of 
silent  segments. 

Figure  2-16  shows  die  spectral-based  boundary  detection  results  for  the  word 
"wield"  spoken  by  a  male  speaker.  The  algorithm  marks  boundaries  at  the  beginning  of 
the  /w/,  at  the  beginning  of  the  relatively  stationary  portion  of  the  vowel  /i/,  at  the  end  of 
the  transition  from  Uie  vowel  to  die  liquid  /I/,  and  at  the  beginning  of  the  release  of  the  /d/. 
While  all  of  these  points  are  clearly  seen  from  Figure  2-1 6a  as  dividing  lines  between 
different  parts  of  die  word,  the  locations  may  not  always  agree  with  the  results  obtained 
by  manual  parsing.  For  example,  in  the  transition  from  the  /i/  to  die  l\J,  manual  parsing 
might  put  die  location  of  die  ti-ansition  at  the  middle  of  die  second  formant  transition 
region,  instead  of  at  the  end  of  the  transition.  However,  diis  lack  of  agreement  between 
automatic  and  manual  segmentation  results  can  often  be  attributed  to  die  lack  of  a 
universally  accepted  mediod  to  manually  specify  the  "correct"  transition  point  between 
two  phonemes. 

2.5.2  V/IJ/S  Boundarv  Detection 

The  second  segmentation  algorithm  is  based  upon  die  voiced  /  unvoiced  /  silent 
(VAJ/S)  feature  detection  algoridim  results.  The  raw  VAI/S  results  are  processed  using  die 
pattern  recognition  rules  listed  in  Table  2-1 .  Boundaries  are  determined  to  occur  wherever 
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Figure  2-16.    Spectral-based  boundary  detection  for  the  word  "wield. 

a)  Spectrogram; 

b)  Frame  association; 

c)  Spectral  boundaries. 
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transitions  in  the  V/U/S  track  occur.  The  boundary  is  marked  at  the  first  sample  of  the  first 
frame  of  the  new  segment 

Figiu-e  2-17  shows  the  V/U/S  boundary  detection  results  for  the  word  "wield" 
spoken  by  a  male  speaker.  Note  that  as  discussed  in  Section  2.4.7,  the  release  of  the  /d/  more 
closely  resembles  a  /t/,  and  is  classified  as  unvoiced. 

2.5.3  Final  Segmentation 

Results  from  both  the  spectral  segmentation  and  the  V/U/S  segmentation 
algorithms  are  used  in  the  final  segmentation  process.  All  boundaries  from  the  V/U/S 
algorithm  are  marked  as  boundaries  in  the  final  result.  Any  boundary  from  the 
spectral-based  boundary  detection  algorithm  that  occurs  in  die  middle  of  a  voiced  segment 
(as  determined  from  the  V/U/S  results)  is  also  marked  as  a  boundary  in  the  final  result, 
provided  that  the  boundary  occurs  at  a  frame  that  is  located  greater  than  two  frames  away 
from  any  V-U,  U-V,  V-S,  or  S-V  boundary.  This  "two-fi-ame  rule"  keeps  the  two 
algorithms  from  marking  the  same  phoneme  boundary  as  two  separate,  but  closely  spaced 
boundaries. 

Figure  2-18  shows  the  final  segmentation  results  for  the  word  "wield"  as  spoken 
by  a  male  speaker.  Note  that  the  boundaries  in  both  Figure  2-1 8b  and  2-1 8c  are  added 
together  to  create  the  final  segmentation  results  that  are  shown  in  a  later  figure.  The  two 
spectral  boundaries  fi-om  Figure  2-1 6c  at  (approximately)  sample  numbers  37(X)  and 
10, 1(X)  have  been  discarded  in  Figure  2-1 8c  since  they  coincide  with  the  V/U/S  boundaries 
and  are  eliminated  by  the  two-frame  rule. 

The  inclusion  of  the  spectral  segment  boundaries  creates  a  type  of 
"sub- segmentation"  that  further  divides  voiced  segments  into  regions  of  smaller  durations. 
Examples  of  this  are  the  semivowel-vowel  and  tiie  vowel-semivowel  transitions  shown 
in  Figure  2-18.  Here,  the  boundaries  between  the  /w/  and  the  /i/,  and  the  /i/  and  the  /I/, 
divide  the  voiced  region  into  three  distinct  parts. 
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Figure  2-17.    W/U/S  boundary  detection  for  the  word  "wield. 

a)  Time-domain  waveform; 

b)  V/U/S  classification; 

c)  V/U/S  boundaries. 
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Figure  2-18.    Final  boundary  detection  for  the  word  "wield. 

a)  Time-domain  waveform; 

b)  VAJ/S  boundaries; 

c)  Spectral  boundaries. 
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Spectral  boundaries  that  occur  in  the  middle  of  unvoiced  segments  are  ignored. 
This  is  done  to  lessen  mistakes  in  subsequent  labeling,  since  the  specific  token  words  that 
are  analyzed  in  this  study  do  not  contain  double  consonant  patterns.  Note,  however,  that 
double  consonant  patterns  regularly  exist  in  the  English  language.  Therefore,  in  future 
work,  the  spectral  boundaries  that  occur  in  unvoiced  segments  should  be  included  in  the 
final  segmentation  results. 

2.6  Segment  Labeling 

Segment  labeling  is  defined  as  the  task  of  correctly  assigning  a  label  from  one  of 
the  eight  speech  segment  categories  of  Section  2.1  to  each  (unknown  type)  segment 
produced  by  the  final  segmentation  algorithm  of  Section  2.5.3. 

The  labeling  algorithm  first  examines  the  VAJ/S  results  for  each  segment.  If  the 
segment  is  voiced,  it  can  be  labeled  as  either  a  vowel,  semivowel,  nasal,  voice  bar,  or  voiced 
fricative.  If  the  segment  is  unvoiced,  it  can  be  labeled  as  either  an  unvoiced  stop  or  an 
unvoiced  fricative.  If  die  segment  is  silent,  it  can  only  be  labeled  as  silent.  For  each 
segment,  each  of  the  possible  feature  scores  is  averaged  over  the  duration  of  the  segment. 
For  example,  if  the  unknown  segment  is  unvoiced,  then  USS  and  UFS  will  both  be 
averaged  across  the  frames  that  comprise  the  segment.  Since  the  segment  in  this  example 
is  unvoiced,  the  average  scores  for  VWLS,  NS,  S  VS,  VBS  and  VFS  need  not  be  calculated 
for  the  segment.  Likewise,  if  the  segment  is  voiced,  the  average  scores  for  VWLS,  NS, 
SVS,  VBS  and  VFS  are  aU  calculated,  and  the  averages  for  USS  and  UFS  are  not.  The 
average  unvoiced  stop  score,  USSmean,  is  given  by 


i=b 

USSnxeanO)  =  5  -  a  +  1      Z  ^SS^  (2.35) 


■K 
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where  a  is  the  index  of  the  starting  frame  in  segment  j,  and  b  is  the  index  of  the  final  frame 
in  segment  j.  The  averages  for  all  of  the  other  feature  scores  are  calculated  in  the  same 
manner. 

Once  the  average  scores  are  calculated  for  all  of  the  unknown  segments,  a  first 
choice  label,  Ll(j),  and  a  second  choice  label,  L2(j),  are  selected  for  each  segment  j.  The 
first  choice  label  for  each  segment  is  the  feature  with  the  highest  mean  score.  The  second 
choice  label  for  each  segment  is  the  feature  with  the  second  highest  mean  score.  ReUability 
scores  are  also  calculated  for  Ll(j)  and  L2(j).  The  reliabUity  scor«,  Rl(j),  for  Ll(j)  is 
defined  as  the  mean  score  for  the  first  choice  label  divided  by  the  sum  of  all  of  the  mean 
scores  for  that  segment.  For  example,  if  a  segment  is  voiced  and  Ll(j)  is  nasal,  then  Rl(j) 
is  given  by 

Rl(i)  = NSmeanCJ) 

VWLSnieana)  +  SVS„,eanO)  +  NSnxeanCJ)  +  VBSn,ean(j)  +  VFSnxeanCJ)    ^   ^      ^ 

Likewise,  if  the  segment  is  unvoiced,  and  L 1  (j)  is  an  unvoiced  fricative,  then  R 1  (j)  is  given 
by 

ni/:\   _  UrJiTieanO) 

^^         UFSn^eanO)  +  USS„,ean(j)  ^^'^^^ 

The  reliability  score,  R2(j),  for  L2(j)  is  defined  as  the  mean  score  for  the  second  choice 
label  divided  by  the  sum  of  all  the  mean  scores  for  that  segment. 

There  are  two  different  cases  where  Rl(j)  and  R20)  can  be  used  to  override  Ll(j). 
In  the  first  case,  if  for  a  given  segment  j,  ( 1 )  the  preceding  segment  is  a  vowel,  and  (2)  the 
current  segment  is  not  a  vowel,  and  (3)  the  current  segment  has  a  mean  vowel  score, 
VWLSmean,  greater  than  0.45,  and  (4)  the  second  choice  for  the  segment,  L2(j),  is  a  vowel, 
and  (5)  the  reliability  of  the  second  choice,  R2(j),  is  greater  than  0.35,  then  LI  0)  is  changed 
to  a  vowel.  This  is  done  because  the  spectral  sub-segmentation  algorithm  of  Section  2.5. 1 
occasionally  divides  a  single  vowel  into  two  or  more  parts.  Often  this  is  because  the  latter 
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part  of  the  vowel  is  actually  starting  a  transition  to  a  following  non- vowel  and  is  exhibiting 
coarticulation  effects.  The  coarticulation  effects  may  be  strong  enough  to  cause  the  final 
portion  of  the  vowel  to  be  classified  as  a  non- vowel.  Therefore,  if  the  final  vowel  segment 
meets  the  conditions  listed  above,  it  can  be  assumed  that  coarticulation  is  taking  place,  and 
that  the  segment  is  actually  a  vowel. 

The  second  case  where  Rl(j)  and  R2(j)  are  used  to  possibly  override  Ll(j)  is  as 
follows.  First,  this  rule  is  invoked  only  if  the  preceding  rule  was  not  invoked  for  the  current 
frame.  Then,  if  (1)  the  current  segment  is  a  vowel,  and  (2)  the  current  segment  has  a  mean 
vowel  score,  VWLSmean,  less  than  0.50,  and  (3)  the  reliability  of  the  second  choice,  R2(j), 
is  greater  than  0. 10,  then  LI  (j)  is  changed  to  the  second  choice,  L2(j).  This  is  done  because 
the  average  vowel  score  is  less  than  0.5,  which  implies  from  Section  2.4.3  that  the  average 
voiced  consonant  score  (which  is  not  calculated  since  it  is  not  a  segment  category)  is  greater 
than  0.5.  This  indicates  that  the  segment  is  actually  some  type  of  voiced  consonant. 

Figure  2-19  shows  the  fmal  segmentation  and  labeling  results  for  the  word  "wield" 
spoken  by  a  male  speaker.  The  symbol  "si"  denotes  silence,  the  symbol  "SV"  denotes 
semivowel,  the  symbol  "V"  denotes  vowel,  and  the  symbol  "US"  denotes  unvoiced  stop. 
The  durations  of  the  individual  segments  are  shown  as  the  number  of  samples  in  2-19b. 
The  starting  sample  number  of  each  new  segment  is  shown  in  Figure  2-1 9c. 

2  J  Manual  Modification  of  Automatic  Segmentation  and  Labeling  Re.siilt«; 

All  speech  recognition  algoritiims  make  mistakes.  The  number  and  nature  of  the 
mistakes  depend  upon  many  factors  including  the  variability  of  human  speakers,  the  choice 
of  recognition  categories,  and  die  recognition  algorithms  themselves.  Since  this  study's 
primary  goal  is  to  modify  the  time  sequence  of  speech,  and  not  to  create  a  totally  new  speech 
recognition  scheme,  the  errors  that  occur  in  automatic  segmentation  and  labeling  are  fixed 
manually  before  the  time  modification  algorithms  are  invoked. 
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Figure  2-19.    Final  segmentation  and  labeling  for  the  word  "wield." 

a)  Time-domain  waveform; 

b)  Segment  labels  and  segment  durations  (in  number  of  samples); 

c)  Segment  boundary  points  (boundary  sample  number). 
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This  section  describes  both  the  nature  of  the  errors  as  well  as  a  set  of  software 
programs  with  a  graphical  user  interface  (GUI)  created  to  help  the  user  manually  edit  and 
correct  the  automatic  segmentation  and  labeling  results. 

2.7.1   Description  of  Frrnrs 

A  variety  of  different  types  of  errors  can  occur.  Segmentation  errors  result  when 
the  algorithms  pick  either  an  incorrect  number  of  segments,  or  incorrect  locations  for  the 
segment  boundaries.  A  labeling  error  results  when  the  algorithms  pick  the  wrong  label  for 
a  segment.  Figure  2-20  shows  examples  of  these  types  of  errors. 

In  Figure  2-20c,  the  beginning  of  the  unvoiced  fricative  is  incoirectly  detected  as 
starting  at  sample  number  4014.  Examination  of  the  time  waveform  shows  that  the  actual 
beginning  of  the  unvoiced  fricative  is  closer  to  sample  number  3300.  This  type  of  error 
typically  occurs  with  weak,  unvoiced  Mcatives  such  as  /f/,  since  the  energy  level  of  the  Ifl 
is  not  much  greater  than  the  energy  level  of  the  background  noise.  Strong,  unvoiced 
fricatives  such  as  /s/  and  ///  do  not  exhibit  this  problem. 

A  second  type  of  error  is  seen  in  Figure  2-20b.  The  vowel  is  divided  into  four  parts 
by  the  spectral  segmentation  algorithm.  While  this  is  not  a  problem  in  itself,  it  creates  the 
possibility  for  labeling  errors  by  requiring  that  the  four  parts  of  the  same  vowel  be  labeled 
separately. 

A  third  type  of  eiror  is  also  seen  in  Figure  2-20b.  An  error  occurs  in  the  labeling 
stage  for  the  third  pan  of  the  vowel.  The  third  part  is  labeled  as  a  nasal  (N),  instead  of  a 
vowel  (V).  This  occurs  most  often  for  long,  voiced  segments  that  are  comprised  of 
nasalized  vowels  and/or  word-final  vowels. 

Figure  2-21  shows  a  different  type  of  error  caused  by  the  spectral  segmentation 
algoritiim  for  the  word  "veal"  spoken  by  a  male  speaker.  In  this  case,  the  boundary  between 
the  M  and  the  /i/  is  not  detected.  As  a  result,  the  combined  /v/-/i/  segment  is  classified  as 
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Figure  2-20.    Final  segmentation  and  labeling  for  the  word  "foo." 

a)  Time-domain  waveform; 

b)  Segment  labels  and  segment  durations  (in  number  of  samples); 

c)  Segment  boundary  points  (boundary  sample  number). 
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Figure  2-2 1 .    Spectral  segmentation  for  the  word  "veal. 

a)  Spectrogram; 

b)  Frame  association; 

c)  Spectral  boundaries. 
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a  vowel,  since  the  vowel  score  has  the  greatest  mean  value  for  the  segment.  This  typically 
occurs  for  the  weak  voiced  fidcatives,  namely  /v/  and  /th/. 

Figure  2-22  also  shows  the  segmentation  and  labeling  results  for  the  word  "veal" 
spoken  by  a  male  speaker.  Note  that  the  final  portion  of  the  word  in  Figure  2-22b  is 
classified  as  an  unvoiced  stop.  Examination  of  the  spectrogram  in  Figure  2-21a  shows  that 
there  is  significant,  unvoiced,  low  frequency  noise  present  at  the  end  of  the  word.  Listening 
reveals  that  the  noise  is  caused  by  the  speaker  exhaling  after  completion  of  the  word. 
Therefore,  while  unexpected,  an  unvoiced  segment  does  exist  at  the  end  of  the  word,  and 
it  is  correctly  detected. 

2.7.2  Software  and  GUI  for  Manual  Modification 

A  set  of  programs  has  been  created  to  provide  the  user  with  a  convenient  method 
of  displaying  and  modifying  the  results  of  the  automatic  segmentation  and  labeling 
algorithms. 

Figures  2-23  and  2-24  show  the  two  primary  windows  of  the  graphical  user 
interface  for  the  modification  programs.  The  Main  window  is  shown  in  Figure  2-23.  This 
window  allows  the  user  to  save  or  discard  any  modifications,  select  what  is  displayed  in 
the  three  graphs  of  the  Display  window,  and  open  the  three  sub- windows  for  modification 
of  the  segmentation  and  labeling  (S&L)  results.  The  Display  window  is  shown  in  Figure 
2-24.  The  window  is  divided  into  three  graphs.  The  top  graph  displays  either  the  original 
or  modified  version  of  the  time-domain  waveform.  Note  that  the  modified  version  is 
created  whenever  a  silent  segment  is  inserted  into  the  word  via  the  Insert  Silent  Segment 
sub-window.  The  middle  graph  displays  one  of  six  choices:  the  original  or  modified 
time-domain  waveform,  the  original  or  modified  segment  type  and  duration  (T&D)  results, 
or  the  original  or  modified  segment  boundaries.  The  third  graph  displays  the  same  choices 
as  the  second  graph  (independent  of  the  second  graph).    The  user  selects  the  results 
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Figure  2-22.    Final  segmentation  and  labeling  for  the  word  "veal." 

a)  Time-domain  waveform; 

b)  Segment  labels  and  segment  durations  (in  number  of  samples); 

c)  Segment  boundary  points  (boundary  sample  number). 
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Figure  2-23.    Main  window  for  manual  modification  of  segmentation  and 
labeling  results. 
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Figure  2-24.    Display  window  for  manual  modification  of  segmentation  and 
labeling  results. 
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displayed  in  each  of  the  three  graphs  via  the  three  push-button  menus  in  the  "Display" 
section  of  the  Main  window. 

The  three  sub-windows  are  shown  in  Figm-es  2-25  through  2-27.  They  are  invoked 
by  the  three  push-buttons  in  the  bottom  left  portion  of  the  Main  window.  The  Insert  Silent 
Segment  sub-window  is  shown  in  Figure  2-25.  This  allows  the  user  to  insert  a  silent 
segment  of  variable  length  at  the  beginning  of  any  of  the  segment  boundaries.  The  silent 
segment  cannot  be  inserted  into  the  middle  of  a  segment.  Note  that  this  feature  does  not 
correct  the  automatic  S&L  results,  but  rather  gives  the  user  additional  flexibility  in  creating 
test  tokens.  The  Move  Segment  Boundaries  sub-window  is  shown  in  Figure  2-26.  This 
allows  the  user  to  move  any  or  all  of  the  boundaries  between  the  segments.  The  Fix  Labels 
sub- window  allows  the  user  to  change  any  or  all  of  the  labels  for  the  segments,  and  is  shown 
in  Figure  2-27.  In  all  three  of  the  sub- windows,  once  a  parameter  is  modified,  an  "Update" 
button  and  a  "Cancel"  button  appear  (not  shown).  These  allow  the  user  to  either  select  the 
desired  parameter(s)  and  modify  the  S&L  results,  or  discard  the  parameter(s)  without 
modifying  the  S«&L  results  and  start  over,  if  desired.  The  "OK"  button  closes  the  window. 

The  Insert  Silent  Segment  sub- window  is  shown  in  Figure  2-25.  The  length  of  the 
silent  segment  is  controlled  by  the  top  slider,  and  the  slider  position  is  automatically 
rounded  to  the  closest  5  ms,  since  this  is  the  frame  length  of  a  silent  frame  in  the  original 
analysis.  The  user  can  adjust  the  length  of  the  silent  segment  by  moving  the  bar  in  the  center 
of  the  slider  with  a  mouse,  or  by  clicking  on  the  left  or  right  arrow  at  either  end  of  the  slider. 
The  rounded  duration  is  displayed  in  the  small  box  to  the  left  of  the  slider.  The  insertion 
point  of  the  beginning  of  the  silent  segment  is  controlled  by  the  bottom  slider.  The  user 
can  adjust  the  insertion  point  by  moving  the  bar  in  the  center  of  the  slider  with  a  mouse, 
or  by  clicking  on  the  left  or  right  arrow  at  either  end  of  the  slider.  The  insertion  point  and 
the  corresponding  slider  position  are  automatically  rounded  to  the  beginning  sample 
number  of  the  closest  segment  boundary,  and  the  starting  sample  number  is  displayed  in 
the  small  box  to  the  left  of  the  slider.  ;/:^J 
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Figiire  2-25.    Insert  Silent  Segment  sub-window  for  manual  modification  of 
segmentation  and  labeling  results. 
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Figure  2-26.    Move  Segment  Boundaries  sub-window  for  manual  modification  of 
segmentation  and  labeling  results. 
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Figure  2-27.    Fix  Labels  sub- window  for  manual  modification  of  segmentation  and 
labeling  results. 
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The  Move  Segment  Boundaries  sub-window  is  shown  in  Figure  2-26.  The  total 
number  of  sliders  displayed  depends  upon  the  number  of  boundaries  in  the  given  word. 
There  is  one  slider  for  each  segment  boundary.  The  two  segments  that  surround  each 
boundary  are  listed  to  the  far  left  of  the  slider.  The  boundary  position  is  adjusted  by  moving 
the  bar  in  the  center  of  the  slider  with  a  mouse,  or  by  clicking  on  the  left  or  right  arrow  at 
either  end  of  the  slider.  The  slider  position  is  automatically  rounded  to  the  starting  point 
of  the  nearest  frame,  and  the  starting  point  is  displayed  in  the  small  box  to  the  left  of  the 
slider.  The  boundary  for  each  segment  is  automatically  limited  to  be  greater  than  the 
preceding  segment  boundary,  and  less  than  the  following  segment  boundary. 

The  Fix  Labels  sub-window  is  shown  in  Figure  2-27.  The  total  number  of 
push-button  menus  that  are  displayed  depends  upon  the  number  of  segments  in  the  given 
word.  There  is  one  push-button  menu  for  each  segment.  The  current  label  for  each  segment 
is  displayed  in  the  push-button.  The  label  can  be  changed  by  pushing  tiie  push-button  and 
selecting  a  new  label  from  the  pop-up  menu. 

An  additional  feature  of  the  software  is  the  ability  to  combine  or  merge  like, 
adjacent  segments  into  one  larger  segment  of  the  same  type.  This  is  done  by  the  "Merge 
Seg's"  button  in  the  bottom  right  comer  of  the  Main  window.  This  could  be  used,  for 
example,  to  combine  the  two  adjacent  vowel  segments  in  Figure  2-20.  Note  that  once  this 
is  done,  the  number  of  sliders  and  push-buttons  in  the  sub- windows  change  (since  the  total 
number  of  segments  change)  and  are  updated  and  redrawn  automatically. 

The  user  can  use  the  Display  window  to  compare  tiie  "before"  and  "after"  results 
while  editing  the  S&L  parameters.  At  any  point,  the  user  can  discard  all  of  the  edits  and 
start  over  by  pushing  the  "Discard  Changes"  button  in  the  Main  window.  After  editing  is 
finished,  the  user  can  save  the  new  S&L  results  under  a  different  name  by  selecting  the 
"Save  Changes"  push-button.  When  tiiis  button  is  pushed,  a  pop-up  field  appears  in  the 
Main  window  (not  shown)  in  which  the  user  can  type  the  new  name  under  which  the  edited 
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parameters  will  be  saved.  This  allows  the  original,  unmodified  S&L  results  to  be  saved 
for  further  reference  and  editing,  if  desired. 
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CHAPTERS 
TIME  MODinCATION  ALGORITHMS  AND  USER  INTERFACE 

The  time  modification  system  in  this  study  allows  the  user  to  selectively  modify  the 
durations  of  the  segments  that  comprise  the  speech  signal.  The  previous  chapter  describes 
the  portion  of  the  system  that  detects  and  labels  the  speech  segments,  and  this  chapter 
describes  the  portion  of  the  system  that  modifies  the  segment  durations  and  synthesizes  the 
resulting  time-modified  speech. 

The  time  modification  system  is  controlled  by  user-specified  parameters  that  allow 
precise  control  over  modification  of  the  speech  signal.  For  each  segment,  the  user  can 
specify  the  time  scale  factor,  the  minimum  desired  duration,  and  the  mapping  method  that 
determines  the  portion  of  the  segment  to  be  altered.  The  parameter  specification  is 
accomplished  via  a  graphical  user  interface  (GUI)  program  tiiat  is  written  in  MATLAB  and 
runs  on  a  Sun  Microsystems  workstation.  Once  the  user-specified  parameters  are  specified, 
the  time-modified  speech  is  synthesized  by  a  LPC  speech  synthesizer. 

This  chapter  is  organized  as  follows:  First,  the  LPC  speech  synthesizer  is  briefly 
described.  Next,  an  introduction  of  the  basic  method  used  to  modify  the  speech  is  given, 
along  with  several  examples.  The  user-specified  parameters  are  then  discussed  in  detail. 
The  mapping  method  is  also  discussed.  The  time  modification  algorithm  is  then  presented. 
The  method  used  to  prevent  "glitches"  during  the  synthesis  process  is  described,  and  the 
GUI  controls  are  presented  in  detail. 

3. 1  The  Linear  Prediction  Coding  (LPO  Speech  Synthesizer 

From  Markel  and  Gray  (1976),  the  LPC  speech  synthesizer  is  comprised  of  an 
all-pole  filter  described  in  the  z-domain  by 
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Hi(z)  =  A^  (3-1) 

where  Aj(z)  for  frame  i  of  the  speech  signal  is  given  by 

Aj(z)  =  ao.  +  a^z"^  +  a2.z"^  +  ...  +  a^jz"'^,  aQ.  =  1  (3.2) 

The  vector  Ai(z)  is  calculated  and  stored  for  each  analysis  frame.   The  value  N  =  13 
adequately  models  the  human  speech  production  system  (Hu,  1993). 

The  error  signal  obtained  during  the  LPC  analysis  is  termed  the  residue  signal,  r(n). 
The  residue  signal  is  used  to  excite  the  all-pole  filter  during  synthesis.  Let  Ri(n)  be  defined 
as  the  portion  of  the  residue  signal  obtained  during  the  analysis  of  frame  i.  For  example, 
if  frame  2  begins  at  sample  number  101  and  ends  at  sample  number  150,  then  R2(n)  is  the 
1  by  50  vector  given  by 

R2(n)  =  [  r(lOl),    r(102),    r(103),    ...   r(148),    r(149),    r(150)  ]  (3.3) 

The  input  to  the  LPC  speech  synthesizer  for  frame  i  can  then  be  described  by  the 
ordered  pair  (A  j ,  Rj ) ,  where  Ai  is  a  1  by  N  vector  of  LPC  coefficients,  Ri  is  a  1  by  Mi  vector 
of  the  residue  signal,  and  Mj  is  the  length  (in  number  of  samples)  of  frame  i.  Note  that  Mi 
is  not  constant  in  a  pitch-synchronous  synthesizer,  since  the  frame  size  usually  changes 
from  pitch  period  to  pitch  period. 

3.2  Time  Modification  Basics— Frame  Skipping  and  Frame  Doubling 

The  basic  time  modification  method  used  in  this  study  involves  either  elimination 
(for  compression)  or  doubling  (for  expansion)  of  the  ( A  j ,  Rj )  ordered  pairs  before  they  are 
sent  to  the  LPC  synthesizer.  This  is  best  illustrated  by  the  following  examples.  The 
examples  are  also  depicted  in  Figure  3-1 .  To  synthesize  the  original  speech  token  without 
time  modification,  the  ordered  pairs  (A; ,  R; )  are  sent  to  the  LPC  speech  synthesizer  for 
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Figure  3-1.      Examples  of  time  modification  using  an  LPC  speech  synthesizer. 

a)  Normal  rate:  no  time  modification; 

b)  Fast  rate:  approximately  one-half  the  original  duration; 

c)  Slow  rate:  twice  the  original  duration. 
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i  =  {1,  2,  3,...,  L-2,  L-1,  L},  where  Lis  the  total  number  of  frames  in  the  original 

speech  signal.  As  a  result,  each  frame  is  synthesized  once.  This  is  depicted  in  Figure  3-la. 

To  synthesize  the  token  at  approximately  twice  the  original  speaking  rate  (one-half  the 

original    duration)    the    ordered   pairs    (Aj.Rj)    are    sent   to    the    synthesizer   for 

i  =  {1,  3,  5,...,  L-4,  L-2,  L}.     This  method  skips  every  other  frame  during  the 

synthesis  process.  This  is  depicted  in  Figure  3-1  b.  The  term  approximately  is  used  since 

the  pitch  period  (and  therefore  duration)  of  each  frame  is  not  constant.  To  synthesize  the 

token  at  one-half  the  original  speaking  rate  (twice  the  original  duration)  the  ordered  pairs 

(Ai,Ri),  where  i  =  {1,   1,  2,  2,  3,  3,...,  L-2,  L-2,  L-1,  L-1,  L,  L},  are  sent  to 

the  synthesizer.  This  method  synthesizes  every  frame  twice,  and  is  depicted  in  Figure  3-lc. 

The  three  resulting  speech  tokens  created  by  these  methods  for  the  word  "meat"  are  shown 

in  Figure  3-2.  Note  that  in  each  of  these  example  tokens,  the  silent  segment  preceding  the 

unmodified  word  (from  sample  number  0  to  sample  number  3300)  is  not  modified.  This 

is  done  only  for  demonstration  purposes  to  preserve  the  alignment  of  the  beginnings  of  die 

three  synthesized  speech  tokens  in  the  three  graphs. 

Altiiough  these  examples  are  simple,  they  demonstrate  the  basic  method  of  time 
modification  of  speech  used  in  this  study.  This  method  involves  the  manipulation  of  the 
sequence  of  (Aj ,  R^ )  ordered  pairs  used  as  inputs  by  the  LPC  speech  synthesizer.  Note 
however,  that  these  examples  do  not  demonstrate  selective  time  modification,  since  they 
control  the  information  that  is  removed  or  doubled  in  a  trivial  manner.  Selective  time 
modification  is  accomplished  by  exercising  greater  control  (than  the  previous  examples) 
over  the  ordered  pairs  used  for  synthesis.  In  this  study,  the  selection  of  the  ordered  pairs 
is  based  on  multiple  user-specified  parameters  that  are  based  on  the  phonemic  content  of 
the  speech  token. 
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Figure  3-2.      Examples  of  time-modified  speech. 

a)  Normal  rate:  no  time  modification; 

b)  Fast  rate:  approximately  one-half  the  original  duration; 

c)  Slow  rate:  twice  the  original  duration. 
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3.3  User-Specified  Modificarion  Parameters 


The  user-specified  time  modification  parameters  are  based  on  the  eight  segment 
types  defined  in  Section  2.1.  Each  segment  type  (vowel,  nasal,  unvoiced  fricative,  etc.) 
has  two,  global,  user-specified  parameters.  The  first  parameter  is  the  duration  scale  factor 
(SF).  It  is  expressed  as  a  real  number  with  a  resolution  of  0.01 .  There  are  a  total  of  eight 
SF  parameters,  witii  one  per  segment  type,  i.e.  SFvowei,  SFnasai,  etc.  The  SF  parameter 
specifies  the  desired  ratio  of  the  final  segment  duration  to  the  original  segment  duration. 
For  example,  if  SF^o^gi  =  0.33,  the  duration  ofthe  vowel  segment(s)  in  tiie  time-modified 
word  are  approximately  33%  ofthe  duration  ofthe  corresponding  vowel  segment(s)  in  the 
original  word.  The  term  approximately  is  used  since  the  resolution  of  the  algorithm  is 
controlled  by  the  frame  size.  The  frame  is  the  smallest  unit  that  can  be  added  or  removed 
in  the  algorithm,  and  the  duration  of  a  discrete  number  of  frames  may  not  exactiy  equal  33% 
of  the  original  duration. 

The  second  global  parameter  is  the  minimum  segment  duration  (MD).  It  specifies 
the  minimum  duration  of  tiie  resulting,  time-modified  segment.  It  is  expressed  in 
milliseconds.  There  are  a  total  of  eight  MD  parameters,  one  per  segment  type,  i.e. 
MDvowei.  MDnasai,  ctc.  It  is  important  to  note  that  this  parameter  can  override  the  desired 
final  duration  calculated  from  the  segment's  SF  parameter.  For  example,  if 
^^vowei  ~  0-^'  ^en  the  desired  final  duration  of  each  vowel  segment  in  the 
time-modified  word  is,  initially,  zero  ms.  However,  if  MDyowd  is  greater  than  zero,  the 
algoritiim  automatically  adjusts  SFyowei  so  that  the  final  duration  of  each  vowel  segment 
is  as  close  as  possible  to  MDvowei-  Note  that  the  final  duration  may  not  be  equal  to  MDvowei 
since  the  resolution  is  equal  to  the  duration  of  a  single  frame,  as  discussed  in  the  previous 
paragraph. 

The  manual  scale  factor  (MSF)  is  a  third,  user-specified  parameter.  It  is  not  global 
in  scope,  and  is  not  associated  with  one  particular  type  of  segment.  It  is  expressed  as  a  real 
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number,  with  a  resolution  of  0.01 .  The  MSF  parameter  is  used  to  override  the  SF  parameter 
for  a  given  segment.  A  separate  MSF  value  is  specified  for  each  segment  in  the  speech 
token.  For  example,  in  the  word  "man,"  the  initial  /m/,  the  /a/,  and  the  final  /n/  each  have 
a  unique  MSF  parameter.  The  default  MSF  value  is  unity,  and  the  MSF  parameter  must 
also  be  "activated"  for  each  segment  if  it  is  to  be  used.  The  default  state  is  "inactive."  In 
addition,  the  MD  parameter  can  override  the  desired  final  duration  calculated  from  the 
MSF  parameter  in  the  same  manner  that  the  MD  parameter  can  override  die  desired  final 
duration  calculated  from  the  SF  parameter. 

The  MSF  parameter  allows  a  single  occurrence  of  a  particular  type  of  segment  to 
be  modified  independendy  in  words  that  have  multiple  occurrences  of  the  segment  type. 
An  example  of  how  the  MSF  parameter  is  used  is  as  follows:  Suppose,  fix)m  the  previous 
example,  that  the  word  "man"  is  being  modified.  If  SF^^^^  =  0.50,  then  both  the  initial 
and  final  nasals  are  modified  during  synthesis.  If  the  user  only  wants  the  initial  nasal  to 
be  modified,  he/she  activates  the  MSF  parameter  for  segment  3  (the  final  /n/),  and  sets  its 
value  to  1.00.  Since  the  activated  MSF  parameter  for  segment  3  overrides  any  global  SF 
parameter,  the  final  nasal  /n/  has  a  scale  factor  of  1 .00,  which  results  in  no  time  modification 
of  the  segment. 

3.4  Mapping 

Once  the  desired  final  durations  are  specified,  the  algorithm  must  determine  the 
frames  of  a  given  segment  that  wUl  be  removed  or  doubled.  In  the  simple  examples  of 
Section  3.2,  every  other  frame  was  removed  to  achieve  compression,  and  every  frame  was 
doubled  to  achieve  expansion.  However,  tiiese  techniques  do  not  offer  the  flexibility  of 
modifying  specific  portions  of  a  given  segment  (i.e.  selective  modification). 

This  system  uses  a  method  tiiat  allows  the  user  to  assign  a  weighting  fiinction,  or 
"map,"  to  each  segment.  Each  frame  in  the  segment  is  assigned  a  weight  between  zero  and 
one.  During  synthesis,  die  frames  witii  the  lowest  weights  are  eliminated  (if  SF  <  1.00), 
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and  the  frames  with  the  highest  weights  are  doubled  (if  SF  >  1.00).  Obviously,  if  SF  (or 
MSF)  for  the  segment  is  1.00,  the  weight  is  trivial,  since  frames  are  neither  eliminated  nor 
doubled. 

The  user  selects  one  of  eight  weighting  maps  for  each  segment.  Five  of  the  maps 
are  fixed  (the  "Fixed  maps"),  and  the  other  three  can  be  edited  by  the  user  (the  "User 
maps").  The  maps  are  shown  in  Figures  3-3  and  3-4.  The  Random  (Fixed)  map  shown 
in  Figure  3-3a  arbitrarily  assigns  a  random  weight  to  each  frame  in  the  segment.  The 
Fixed_l  map  in  Figure  3-3b  emphasizes  the  beginning  of  the  segment.  The  Fixed_2  map 
in  Figure  3-3c  emphasizes  die  end  of  the  segment.  The  Fixed_3  map  in  Figure  3-4a 
emphasizes  the  end  points  of  the  segment.  The  Fixed_4  map  shown  in  Figure  3-4b 
emphasizes  the  middle  of  the  segment.  One  of  the  three  user  maps  (the  User_l  map)  is 
shown  in  Figure  3-4c.  The  weighting  function  shown  is  arbitrary  and  can  be  edited  by  the 
user.  The  three  User  maps  are  identical  in  function;  the  only  difference  between  them  is 
that  they  can  each  contain  a  different  weighting  curve.  As  a  result,  the  remaining  two  User 
maps  (User_2  and  User_3)  are  not  shown.  The  three  user  maps  are  saved  for  each  speech 
token.  As  a  result,  each  token  has  its  own,  unique,  mapping  functions.  For  example,  the 
User_l  map  for  the  word  "meat"  is  not  necessarily  the  same  as  the  User_l  map  for  the  word 


"sue." 


Each  map  is  stored  as  a  1  by  100  vector  Since  few  segments  have  exactly  100 
frames,  an  interpolation  process  determines  the  actual  weight  for  each  frame  in  die 
segment.  The  process  first  maps  the  weighting  map  onto  the  segment  of  interest.  This  is 
done  by  creating  a  temporary  map  (of  length  1  by  N  samples)  by 


temp_map(i)  =  map(j),  1  <  i  <  N,   j  =  ceil  j   ^^  ^ 


(3.4) 


where  map(j)  is  the  original  1  by  100  weighting  map,  N  is  the  length  of  die  segment  in 
samples,  and  ceil  { }  is  a  function  which  rounds  the  argument  up  to  the  closest  integer  Once 
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Figure  3-3.      User- selectable  mapping  functions, 
a)  Random; 
b)Fixed_l; 
c)  Fixed_2. 
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Figure  3-^.      User-selectable  mapping  functions. 

a)  Fixed_3; 

b)  Fixed_4; 

c)  User_l. 
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the  temporary  map  is  calculated,  the  sample  number  that  resides  at  the  center  of  each  frame 
is  then  determined.  The  weight  for  each  frame  of  the  interpolated  map  is  the  value  of 
temp_map(i)  for  the  center  sample  of  each  frame. 

An  example  of  the  map  interpolation  is  shown  in  Figure  3-5  for  the  vowel  segment 
of  the  word  "wield"  spoken  by  a  male  speaker.  The  Fixed_3  map  is  shown,  after 
interpolation,  in  Figure  3-5b.  Note  that  each  pitch  period  (i.e.  frame)  in  Figure  3-5a  has 
exactly  one  corresponding  weight  in  Figure  3-5b. 

3.5  Time  Modification  and  Synthesis 

The  time  modification  and  synthesis  processes  are  described  in  the  following 
paragraphs.  A  block  diagram  of  the  time  modification  algorithm  is  shown  in  Figure  3-6. 
The  algorithm  begins  by  creating  a  1  by  M  vector,  Fsave,  of  frame  indices  where 

Fsave  =  [  1,    2,    3,    ...   M-2,    M-1,    M  ]  (3.5) 

and  M  is  the  total  number  of  frames  in  the  original,  unmodified  word.  Once  the  scale 
factors  and  minimum  durations  are  specified,  the  algorithm  calculates  the  desired  final 
duration  for  each  segment.  The  process  then  enters  an  iterative  loop  for  each  segment:  If 
frames  are  to  be  cut  in  a  particular  segment,  the  algorithm  examines  the  interpolated  map 
for  the  segment,  and  removes  the  frame  index  from  Fsave  that  corresponds  to  the  frame  in 
the  segment  with  the  lowest  weight.  As  a  result,  Fsave  becomes  a  1  by  M- 1  vector.  If  frames 
are  to  be  added  (instead  of  cut),  the  frame  index  with  the  highest  weight  is  duplicated,  and 
Fsave  becomes  a  1  by  M+1  vector.  The  duration  of  the  total  (new)  number  of  frames  in  the 
segment  is  then  calculated  (this  is  denoted  as  Durationtemp  in  Figure  3-6),  and  compared 
to  the  desired  final  duration.  The  algorithm  continues  in  the  loop,  and  removes  (or  adds) 
one  frame  index  from  Fsave  during  each  pass.  Once  the  desired  final  duration  is  reached 
for  the  segment,  the  process  exits  the  loop.  The  algoridim  repeats  the  loop  for  each  segment 
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Figure  3-5.      Interpolated  map  for  the  vowel  segment  of  the  word  "wield.' 

a)  Time-domain  waveform; 

b)  Interpolated  weighting  map  (Fixed_3  map). 
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Figure  3-6.    Block  diagram  of  the  time  modification  algorithm. 
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that  is  modified.  The  final  result  is  Fsave,  which  contains  only  the  indices  of  the  frames  that 
are  used  to  synthesize  the  time-modified  speech. 

Note  that  the  algorithm  cuts  (or  adds)  the  number  of  frames  that  causes  the  actual 
final  duration  for  the  segment  to  be  as  close  as  possible  to  the  desired  final  duration. 
However,  these  two  durations  are  not  always  the  same.  Tests  show  that  the  actual  scale 
factor  (calculated  as  the  ratio  of  the  duration  of  the  modified  segment  to  the  duration  of  the 
corresponding  unmodified  segment)  usually  only  differs  by  one  to  two  percentage  points 
from  the  desired  scale  factor.  The  tests  also  show  that  the  actual  segment  duration  is  within 
about  2.5  ms  of  the  desired  segment  duration  for  unvoiced  segments,  and  about  5.0  ms  of 
the  desired  segment  duration  for  voiced  segments. 

During  synthesis,  Fsave  controls  the  fi-ames  to  be  synthesized.  Before  synthesis, 
Fsave  is  numerically  sorted  to  guarantee  that  the  elements  occur  in  ascending  order.  The 
synthesizer  then  reads  one  fi-ame  index  from  Fsave  for  each  fi-ame  of  speech  that  is 
synthesized.  It  uses  the  corresponding  (Ai,  Ri)  ordered  pair  to  obtain  the  excitation  signal 
and  filter  coefficients. 

The  synthesizer  architecture  is  a  single,  13th-order,  Direct-I  type  filter.  The  typical 
cascade  of  multiple,  2nd-order  sections  is  not  used.  Since  all  of  the  calculations  are  done 
in  MATLAB,  the  filter  architecture  does  not  exhibit  significant  numerical  precision 
problems  (i.e.  instability  due  to  truncated  coefficients). 

3.6  Glitch  Prevention 

It  is  important  to  ensure  that  the  synthesizer  does  not  create  artificial 
discontinuities,  or  "glitches,"  in  the  output  signal.  These  glitches  can  occur  at  frame 
boundaries  as  the  result  of  discontinuities  in  the  Fsave  vector  when  frames  are  removed. 
The  discontinuities  can  also  result  from  frames  being  doubled,  although  the  effects  are  not 
as  severe  as  those  caused  by  frame  elimination.  Frame  removal  can  result  in  discontinuities 
in  the  final  excitation  (residue)  signal,  as  well  as  large,  abrupt  changes  in  the  frequency 
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response  of  the  LPC  filter.  In  general,  in  order  to  prevent  glitches,  the  contents  of  the  13 
filter  taps  are  analyzed  at  the  fi-ame  boundaries.  The  contents  of  the  filter  taps  are  either 
preserved  or  modified,  depending  upon  the  specific  case. 

In  the  first  case,  if  the  indices  of  the  frames  on  either  side  of  the  boundary  are 
sequential  with  no  missing  frames,  the  filter  contents  are  not  modified.  Thus,  the  final 
conditions  of  the  filter  after  the  last  sample  of  the  previous  fi-ame  are  the  initial  conditions 
of  the  filter  for  the  first  sample  of  the  current  fi-ame.  For  example,  if  Fgave  =  [1,2],  then 
the  final  conditions  for  frame  1  are  the  initial  conditions  for  frame  2. 

In  the  second  case,  if  the  indices  of  the  frames  on  either  side  of  the  boundary  are 
not  sequential  and  have  doubled  frames,  the  filter  contents  are  not  modified.  Thus,  the  final 
conditions  of  the  filter  after  the  last  sample  of  the  previous  frame  are  the  initial  conditions 
of  the  filter  for  the  first  sample  of  the  current  frame.  For  example,  this  applies  if 
Fsave  =  [1, 1, 1].  Note,  however,  that  the  excitation  signal  is  modified  according  to  a 
method  developed  by  Hu  (1993).  This  modification  is  done  in  two  stages.  In  both  stages, 
it  is  assumed  that  the  previous  frame  index  is  the  same  as  the  current  frame  index.  In  the 
first  stage,  the  amplitude  of  the  excitation  signal  associated  widi  the  current  frame  is 
multiplied  by  a  scale  factor.  This  is  done  to  simulate  shimmer.  The  scale  factor  is  constant 
over  the  duration  of  the  frame,  but  changes  randomly  from  frame  to  frame.  The  scale  factor 
has  a  uniform  distribution  over  the  range  [0.975, 1 .025] .  In  the  second  stage,  if  tiie  doubled 
frame  is  unvoiced,  the  excitation  signal  for  the  current  fi^me  is  replaced  by  a  zero-mean, 
white  noise  sequence  that  has  the  same  length  as  the  excitation  signal.  The  amplitude  of 
the  white-noise  excitation  signal  is  scaled  to  have  the  same  root-mean- square  (RMS)  value 
as  the  original  excitation  signal  for  the  frame.  This  is  done  by  Hu  (1993)  to  prevent  audible 
artifacts,  such  as  "warble"  or  "phasing"  effects  that  occur  during  time  expansion  of 
unvoiced  segments.  In  theory,  this  is  not  required,  since  the  excitation  (i.e.  the  LPC 
residue)  for  unvoiced  fi^mes  is  a  white-noise  sequence.  Therefore,  the  act  of  repeating  the 
sequence  should  not  create  the  artifacts  described.  However,  due  to  the  relatively  short 
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duration  of  an  unvoiced  frame  (50  samples),  the  residue  for  a  single  frame  typically  does 
not  exhibit  a  flat  frequency  spectrum,  and  the  act  of  repeating  the  "non-white"  sequence 
may  indeed  cause  artifacts  to  be  perceived,  due  to  the  possibility  of  periodicity  existing  in 
the  resulting  duplicated  excitation  sequences. 

In  the  third  case,  if  the  indices  of  the  frames  on  either  side  of  the  boundary  are  not 
sequential  and  one  or  more  frames  are  missing,  a  special  rule  is  invoked.  The  filter 
calculates  the  initial  conditions  that  would  exist  if  the  single,  final  missing  frame  was 
present,  and  uses  these  filter  states  as  the  initial  conditions  for  the  synthesized  frame 
immediately  following  Uie  missing  (un-synthesized)  frames.  For  example,  if 
Fsave  =  [1, 2, 9],  the  filter  calculates  the  final  conditions  that  would  be  present  if  frame  8 
were  synthesized  after  frame  2,  and  uses  these  as  the  initial  conditions  for  frame  9.  The 
motivation  behind  this  rule  is  as  follows:  The  discontinuities  are  ahnost  always  a  result  of 
the  energy  that  is  stored  in  the  digital  filter  being  greatiy  amplified  by  the  new  LPC 
coefficients  immediately  after  these  coefficients  are  updated.  This  occurs  most  often  when 
frames  are  removed  from  an  unvoiced-voiced  transition  region.  Since  the  gain  of  the  LPC 
filter  is  not  normalized  during  the  initial  analysis  to  have  a  constant,  average  frequency 
response  from  frame  to  frame,  die  average  gain  of  the  LPC  filter  for  voiced  regions  is  much 
greater  than  the  average  gain  of  the  filter  during  unvoiced  regions.  Thus,  if  no  glitch 
prevention  is  done,  tiie  stored  energy  in  the  digital  filter  during  syndiesis  of  an  unvoiced 
frame  is  greatiy  amplified  when  die  LPC  coefficients  are  abruptiy  updated  to  those  of  a 
voiced  frame. 

An  example  of  this  is  shown  in  Figure  3-7  for  the  transition  from  the  /s/  to  the  /u/ 
in  the  word  "sue"  spoken  by  a  male  speaker.  Figure  3-7a  shows  the  original,  unmodified 
word.  Figure  3-7b  shows  the  synthesized  word  where  the  fmal  50%  of  the  /s/  is  removed, 
witii  no  glitch  prevention.  Note  tiiat  die  filter  creates  a  large  glitch  diat  dominates  the  output 
for  several  glottal  cycles.  Figure  3-7c  shows  die  syntiiesized  word  where  die  final  50% 
of  the  /s/  is  removed,  using  die  glitch  prevention  rule  described  above.    There  is  no 
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Figure  3-7.      Glitch  prevention  for  the  unvoiced-voiced  transition  in  the  word  "sue." 

a)  Original  unmodified  waveform; 

b)  Synthesized  with  final  50%  of /s/  removed  and  glitch  prevention  off; 

c)  Synthesized  with  final  50%  of /s/ removed  and  glitch  prevention  on. 
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noticeable  discontinuity  at  the  transition,  and  the  transition  region  closely  resembles  that 
of  the  original  word  (ignoring,  obviously,  the  portion  of  the  /s/  tiiat  has  been  removed). 

3.7  GraphicalTJser  Interface  rCiTm 

A  graphical  user  interface  has  been  developed  that  offers  the  user  a  convenient, 
user-friendly  method  of  creating  time-modified  speech  tokens,  using  both  the 
segmentation  and  labeling  results  and  tiie  user-specified  time  modification  parameters  as 
inputs.  This  section  discusses  tile  structure  and  controls  oftiie  time  modification  GUI.  The 
algoritiims  tiiat  comprised  tiie  GUI  are  written  in  MATLAB,  and  run  on  a  Sun 
Microsystems  UNIX  workstation. 

Before  starting  the  time  modification  programs,  the  user  must  define  die  name  of 
the  unmodified  speech  token  (which  is  also  the  base  name  of  both  tiie  LPC  analysis  results 
and  the  segmentation  and  labeling  results)  as  the  MATLAB  variable  "name."  After  tiie  time 
modification  program  starts,  this  signal  name  is  displayed  in  tiie  top,  left-hand  comer  of 
the  Main  GUI  window.  In  order  to  load  a  new  signal  for  modification,  the  user  must  quit 
tiie  program,  declare  tiie  new  signal  name  as  tiie  MATLAB  variable  "name,"  and  restart 
the  program. 

The  following  sections  describe  the  windows  and  user-controls  that  comprise  the 
time  modification  GUI.  The  accompanying  figures  illustrate  tiiese  windows  and 
demonstrate  the  procedure  typically  followed  to  modify  a  speech  token.  In  tiie  figures,  tiie 
process  of  modifying  tiie  nasal  portion  of  tiie  word  "meat"  is  used  as  an  example. 

3.7.1  Main  Window 

The  Main  window  of  the  time  modification  GUI  is  shown  in  Figure  3-8.  This 
window  serves  as  a  "master  control  panel"  for  the  time  modification  and  syntiiesis 
processes. 
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Figure  3-8.    Main  window  for  time  modification  GUI. 
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Figure  3-9.    Preview  window  for  time  modification  GUI. 
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The  Functions  section  of  die  Main  window  invokes  the  various  steps  required  in 
the  time  modification  and  synthesis  processes.  The  buttons  are  arranged  in  a  left-to-right 
sequence  that  symbohcally  displays  the  order  typically  followed  during  modification  of  a 
speech  token.  The  "Save"  button  saves  all  of  the  user-specified  parameters  to  hard  disk  for 
the  token.  These  values  are  also  loaded  as  the  defauh  user-specified  parameters  the  next 
time  the  program  is  started  for  that  particular  token.  This  allows  ±e  user  to  resume  the 
modification  process  with  tiie  user-specified  parameter  values  that  were  last  used  for  a 
given  token.  If  the  user-specified  parameters  are  not  available  from  hard  disk  (i.e.  the  user 
is  modifying  a  token  for  tiie  first  time),  tiie  program  uses  a  default  set  of  initial  values  for 
the  user-specified  parameter  values.  The  "Map"  button  invokes  tiie  program  tiiat  creates 
and  stores  (to  hard  disk)  the  final  Fgave  vector  for  use  in  die  synthesis  process.  This  program 
is  described  in  Section  3.5.  The  tabulated  results  of  the  program,  in  terms  of  tiie  initial  and 
final  segment  durations  and  scale  factors,  are  displayed  in  the  MATLAB  shell  window  (not 
shown).    The  "Synth"  button  invokes  the  program  tiiat  synthesizes  tiie  time-modified 
speech  using  both  tiie  LPC  analysis  results  and  tiie  user-specified  parameters  as  inputs. 
After  synthesis,  the  user  typically  selects  one  of  two  options  for  audio  playback  of  the 
time-modified  speech.  The  first  option,  or  tiie  "top  branch"  in  tiie  figure,  plays  tiie  speech 
tiirough  the  workstation's  digital-to-analog  (D/A)  haidware.  The  second  option,  or  tiie 
"bottom  branch"  in  tiie  figure,  is  comprised  of  two  push-buttons.  The  "Syn2ESPS"  button 
first  converts  tiie  syntiietic  speech  from  a  MATLAB  binary  format  to  an  Entropic  Research 
Laboratories  (ERL)  binary  format.  The  "Save  ESPS"  button  tiien  saves  tiie  ERL-formatted 
synthetic  speech  to  hard  disk.   When  tiiis  button  is  pushed,  a  pop-up,  text-editor  field 
appears  (not  shown)  above  tiie  "Save  ESPS"  button  in  which  tiie  user  types  tiie  name  of 
the  file  to  be  saved.  A  "Cancel  ESPS"  button  (not  shown)  also  appears.  After  tiie  user  types 
tiie  file  name,  he/she  selects  eitiier  tiie  "Save  ESPS"  or  tiie  "Cancel  ESPS"  button  to  save 
tiie  file  or  cancel  tiie  save  process,  respectively.  Note  tiiat  for  tiie  bottom  branch,  tiie  user 
must  use  external  ERL  software  in  order  to  play  the  syntiietic  speech. 


■  ''i 
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The  Parameter  Windows  section  of  the  Main  window  contains  six  push-buttons  that 
invoke  the  six  windows  that  allow  the  user  to  both  display  the  results  and  specify  the  time 
modification  parameters.  The  buttons  are  arranged  in  a  left-to-right  sequence  that 
symboUcally  displays  the  order  typically  followed  in  the  time  modification  process.  The 
six  windows  are  described  in  the  following  sub-sections. 

3.7.2  Preview  Window 

The  "Preview"  button  in  the  Main  window  opens  the  Preview  window  shown  in 
Figure  3-9.  The  window  displays  tiie  segmentation  and  labeling  results.  The  top  graph 
displays  the  original,  unmodified  speech.  The  middle  graph  displays  the  segment  labels 
and  durations  (in  number  of  samples).  The  segment  labels  si,  V,  SV,  N,  US,  UF,  VB,  and 
VF  correspond  to  silence,  vowel,  semivowel,  nasal,  unvoiced  stop,  unvoiced  fricative, 
voice  bar,  and  voiced  fricative,  respectively.  The  bottom  graph  displays  the  segment 
boundaries,  i.e.  die  beginning  sample  number  of  each  segment.  The  "OK"  button  closes 
the  window. 

3.7.3  Scale  Factors  Window 

The  "SF's"  button  in  the  Main  window  opens  the  Scale  Factors  window  shown  in 
Figure  3-10.  The  window  contains  several  push-buttons  at  the  top  of  the  window,  and  eight 
label  /  display  /  slider  "combinations"  arranged  in  rows.  Each  row  represents  one  of  the 
eight  segment  types.  The  sliders  as  viewed  from  top  to  bottom  control  SFvowei,  SFnasai, 
^'"semivowel.  SFvoicebar>  SFvoicedfricative»  SFunvoicedstop.  SFunvoicedfricative.  and  SFsHent. 
respectively.  For  each  row,  die  segment  type  is  displayed  to  the  far  left.  The  desired  scale 
factor  (SF)  is  adjusted  by  moving  the  bar  in  the  center  of  the  slider  with  a  mouse,  or  by 
clicking  die  left  orright  arrow  at  either  end  of  the  slider.  The  slider  position  is  automatically 
rounded  to  die  nearest  one-hundreddi,  and  die  rounded  scale  factor  is  displayed  in  the  small 


109 


/^ 


mmmmmmm 


%Fsi^eeSB  fiB0tmrs 


rmmrmmmmftm 


Vowel 
Nasal 
Semlvwl 
Voicebar 
V.  Fric 
Unv,  Stop 
Unv.  Fric 
Silent 


Mfi^ 


Figure  3-10.    Scale  Factors  window  for  time  modification  GUI. 
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box  to  the  left  of  the  slider.  The  range  of  values  for  SF  is  limited  to  [0, 3.0].  Note,  however, 
that  this  can  be  changed  by  modifying  one  line  in  the  software. 

The  displayed  SF  values  and  slider  positions  are  loaded  firom  hard  disk  the  first  time 
that  the  window  is  opened.  If  the  user  modifies  any  of  the  SF  parameters,  both  an  "Update" 
and  a  "Cancel"  button  appear  (not  shown)  next  to  the  "OK"  button.  In  order  to  accept  and 
save  the  changes,  the  user  must  press  the  "Update"  button.  If  the  user  presses  the  "Cancel" 
button,  the  sUder  positions  and  SF  values  are  reset  to  their  original  values.  The  "Unity" 
button  at  the  top  of  the  window  sets  all  of  the  SF  values  to  unity.  This  resets  the  time 
modification  process  to  a  "no-modification  state."  The  "OK"  bunon  closes  the  window 
without  saving  the  displayed  values. 

3.7.4  Minimum  Durations  Window 

The  "MD's"  button  in  the  Main  window  opens  the  Minimum  Durations  window 
shown  in  Figure  3-11.  The  window  contains  several  push-buttons  at  the  top  of  the  window, 
and  eight  label  /  display  /  slider  combinations  arranged  in  rows.  Each  row  represents  one 
of  the  eight  segment  types.  The  sliders  as  viewed  from  top  to  bottom  control  MDvowel. 

MDnasal'  MiJsemivoweb  MDvoicebar>  MDvoicedfncativei  MDunvoicedstop>  MDunvoicedfncative> 
and  MDsj]ent,  respectively.  For  each  row,  the  segment  type  is  displayed  to  the  far  left.  The 
desired  minimum  duration  (MD)  is  adjusted  by  moving  the  bar  in  the  center  of  the  sUder 
with  a  mouse,  or  by  clicking  the  left  or  right  arrow  at  either  end  of  the  slider.  The  slider 
position  is  automatically  rounded  to  the  nearest  millisecond,  and  the  rounded  minimum 
duration  is  displayed  in  the  small  box  to  the  left  of  the  slider.  The  range  of  values  for  MD 
is  limited  to  [0, 250]  ms.  Note,  however,  that  this  can  be  changed  by  modifying  one  line 
in  the  software. 

The  displayed  MD  values  and  slider  positions  are  loaded  from  hard  disk  the  first 
time  that  the  window  is  opened.  If  the  user  modifies  any  of  the  MD  parameters,  both  an 
"Update"  and  a  "Cancel"  button  appear  (not  shown)  next  to  the  "OK"  button.  In  order  to 
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accept  and  save  the  changes,  the  user  must  press  the  "Update"  button.  If  the  user  presses 
the  "Cancel"  button,  the  slider  positions  and  MD  values  are  reset  to  their  original  values. 
The  "Defaults"  button  at  the  top  of  the  window  sets  all  of  the  MD  values  to  a  predefined 
value,  which  is  10  ms.  This  default  value  can  be  changed  by  modifying  one  line  in  the 
software.  The  "OK"  button  closes  the  window  without  saving  the  displayed  values. 

3.7.5  Manual  Scale  Factors  Window 

The  "Man  SF's"  button  in  the  Main  window  opens  the  Manual  Scale  Factors 
window  shown  in  Figure  3-12.  The  window  contains  several  push-buttons  at  the  top  of 
the  window,  and  a  variable  number  of  push-button  /  display  /  slider  combinations  arranged 
in  rows.  Each  row  represents  one  segment  (of  any  type)  in  the  speech  token  time  sequence. 
The  top  row  corresponds  to  the  first  non- silent  segment,  and  the  bottom  row  coiresponds 
to  the  last  non-silent  segment  in  the  speech  token.  It  is  assumed  (and  required)  that  the  first 
and  last  segments  are  always  silent.  As  shown  in  Figure  3- 1 2  for  the  example  word  "meat", 
the  first  row  corresponds  to  the  nasal  /m/,  which  is  segment  number  2  (since  segment  1  is 
always  silent  and  by  convention,  cannot  be  modified),  and  the  last  row  corresponds  to  the 
imvoiced  stop  /t/,  which  is  segment  5. 

For  each  row,  both  the  index  (i.e.  segment  number)  and  segment  type  are  displayed 
within  the  push-button.  There  is  also  a  diamond-shaped  box  inside  the  push-button.  If  the 
box  is  filled,  the  push-button  is  "on,"  and  the  MSF  parameter  is  active  for  the  segment 
If  the  box  is  not  filled,  the  push-button  is  "off,"  and  the  MSF  parameter  is  inactive  for  the 
segment.  In  addition,  if  the  MSF  parameter  is  active,  a  slider  and  corresponding  numerical 
display  box  are  displayed  to  the  right  of  the  push-button.  These  adjust  and  display  the  value 
of  the  MSF  parameter  for  the  segment.  The  MSF  value  is  adjusted  by  moving  the  bar  in 
the  center  of  the  slider  with  a  mouse,  or  by  clicking  the  left  or  right  arrow  at  either  end  of 
the  slider.  The  slider  position  is  automatically  rounded  to  the  nearest  one-hundredth,  and 
the  rounded  value  is  displayed  in  the  small  box  to  the  left  of  the  slider.  The  range  of  values 
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Figure  3-12.    Manual  Scale  Factors  window  for  time  modification  GUI. 
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Figure  3-13.    Map  window  for  time  modification  GUI. 
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for  MSF  is  limited  to  [0, 2.50].  Note,  however,  that  this  can  be  changed  by  modifying  one 
line  in  the  software. 

As  with  the  other  windows,  the  displayed  values  and  settings  are  loaded  from  hard 
disk  the  first  time  that  the  window  is  opened.  If  the  user  modifies  any  of  the  MSF 
parameters,  both  an  "Update"  and  a  "Cancel"  button  appear  (not  shown)  next  to  the  "OK" 
button.  In  order  to  accept  and  save  the  changes,  the  user  must  press  the  "Update"  button. 
If  the  user  presses  the  "Cancel"  button,  the  slider  positions,  settings,  and  MSF  values  are 
reset  to  their  original  values.  The  "Defaults"  button  at  the  top  of  the  window  sets  all  of  the 
active  MSF  values  to  unity.  It  does  not  modify  the  inactive  MSF  values.  The  "OK"  button 
closes  the  window  without  saving  the  displayed  values  or  settings. 

3.7.6  Map  Window 

The  "Maps"  button  in  the  Main  window  opens  the  Map  window  that  is  shown  in 
Figure  3-13.  Unlike  the  other  windows,  the  Map  window  is  the  top  window  in  a  hierarchy 
of  windows  that  are  all  related  to  the  user-defined  weighting  functions,  or  "maps."  The 
Map  window  is  divided  into  three  sections.  The  Display  section  controls  a  separate  window 
that  displays  either  (1)  one  or  all  of  the  speech  segments  and  the  associated  interpolated 
map(s),  or  (2)  one  of  the  eight  maps  (with  no  interpolation).  The  eight  maps  are  discussed 
in  Section  3^,  and  are:  Random,  Fixed_l,  Fixed_2,  Fixed_3,  Fixed_4,  User_l,  User_2, 
and  User_3.  The  User  Maps  section  of  the  Map  window  contains  the  "Edit"  button  which, 
when  pushed,  opens  a  window  that  is  used  to  edit  and  display  any  one  of  the  diree  User  maps 
(i.e.  User_l,  User_2,  or  User_3).  The  Segment  Mapping  section  of  the  Map  window 
contains  a  variable  number  of  push-buttons  that  correspond  to  the  time  sequence  of 
segments  in  die  speech  token.  There  is  one  push-button  per  speech  segment.  The 
push-buttons  are  arranged  from  left  to  right.  The  leftmost  push-button  corresponds  to  the 
first  non-silent  segment  in  the  token,  and  the  rightmost  push-button  corresponds  to  the  last 
non-silent  segment  in  the  token.  When  pressed,  the  push-button  displays  a  pop-up  menu 
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of  the  eight  map  choices  available  for  the  segment.  The  current  map  choice  for  each 
segment  is  displayed  on  the  face  of  each  push-button.  In  addition,  the  segment  index  and 
label,  or  "Type,"  are  displayed  above  each  of  the  push-buttons. 

3.7.6.1  Map  display  window 

The  Display  section  of  the  Map  window  in  Figure  3-13  contains  a  group  of  four 
mutually-exclusive  push-buttons  labeled  "Segment,"  "Global,"  "OFF,"  and  "Map,"  as  well 
as  a  single  push-button  located  below  the  "Map"  push-button,  whose  label  varies.  The  four 
mutually-exclusive  push-buttons  control  the  display  in  the  Map  Display  window,  and  are 
known  collectively  as  the  "display  mode."  When  die  "OFF'  display  mode  is  selected,  the 
Map  Display  window  is  closed.  When  the  "Segment"  display  mode  is  selected,  the  Map 
Display  window  displays  two  graphs,  as  shown  in  Figure  3-14.  The  top  graph  displays  the 
time-domain  waveform  of  the  segment,  and  the  bottom  graph  displays  the  user-selected, 
interpolated  weighting  map  associated  with  the  segment.  Both  graphs  are  updated  to 
display  the  next,  successive  segment  each  time  that  the  "Segment"  push-button  is  pressed. 
This  allows  the  user  to  scroll  through  the  segments  by  repeatedly  pressing  the  "Segment" 
push-button  until  the  segment  of  interest  is  displayed.  When  the  "Global"  display  mode 
is  selected,  the  Map  Display  window  displays  the  same  two  graphs  as  the  "Segment" 
display  mode,  except  that  the  graphs  span  die  entire  duration  of  die  speech  token.  This  is 
shown  in  Figure  3-1 5.  It  can  be  seen  in  this  example  that  the  word  "meat"  has  four  different 
maps.  This  is  confirmed  by  the  settings  in  the  Segment  Mapping  section  of  the  Map 
window  shown  in  Figure  3-13  for  the  same  word.  When  the  "Map"  display  mode  is 
selected,  only  the  non-interpolated  map  is  displayed  in  the  Map  Display  window.  This  is 
shown  in  Figure  3-16.  The  specific  map  that  is  displayed  can  be  changed  by  the 
push-button  (and  associated  pop-up  menu  that  is  not  shown)  directly  under  "Map" 
push-button  in  the  Map  window  shown  in  Figure  3-13.  Note  in  this  example,  and  from 
Figure  3-13,  that  the  User_l  map  is  selected  and  shown  in  Figure  3-16. 
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Figure  3-14.    Map  Display  window  for  time  modification  GUI. 
Segment  display  mode. 
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Figure  3-15.    Map  Display  window  for  time  modification  GUI. 
Global  display  mode. 
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Figure  3-16.    Map  Display  window  for  time  modification  GUI. 
Map  display  mode. 


Figure  3-17.    Map  Edit  window  for  time  modification  GUI.  Initial  Display. 


117 


3.7.6.2  Map  Edit  window 

One  of  the  key  features  of  the  time-modification  system  is  the  ability  to  edit  the  User 
maps  in  order  to  create  customized  weighting  functions.  This  process  is  initiated  by  the 
"Edit"  push-button  in  the  User  Maps  section  of  the  Map  window.  When  the  "Edit" 
push-button  is  pressed,  a  separate  window  known  as  the  Map  Edit  window  appears,  as 
shown  in  Figure  3-17.  When  first  opened,  this  window  contains  four  push-buttons.  The 
user  then  either  selects  one  of  the  three  User  Maps  to  be  edited,  or  selects  "Cancel"  to  close 
the  Map  Edit  window  and  return  to  the  Map  window.  If  one  of  the  three  User  push-buttons 
is  selected,  the  four  push-buttons  disappear,  and  both  a  graph  and  a  new  combination  of 
push-buttons  are  displayed,  as  shown  in  Figure  3-18.  The  window  at  this  stage  is  still 
termed  the  Map  Edit  window. 

The  Map  Edit  window  shown  in  Figure  3-18  controls  the  User  Map  editing  process. 
The  graph  in  the  top  portion  of  the  window  displays  the  non-interpolated  map  (a  1  by  100 
vector)  as  well  as  a  number  of  moveable  "targets,"  which  are  depicted  by  small  circles  on 
the  graph.  The  bottom  section  contains  several  push-buttons.  The  "#  Targets"  push-button 
controls  the  number  of  targets  displayed  in  the  top  graph.  When  the  push-button  is  pressed, 
a  pop-up  menu  appears  (not  shown)  that  offers  the  user  a  choice  of  either  two,  three,  six, 
or  eleven  targets.  The  current  choice  is  displayed  on  the  face  of  the  push-button,  and  the 
corresponding  number  of  targets  are  drawn  on  the  graph.  If  two  targets  are  selected,  they 
are  located  at  0  and  100  (percent)  on  the  x-axis.  If  three  targets  are  selected,  they  are  located 
at  0,  50,  and  100  (percent)  on  the  x-axis.  If  six  tai^ets  are  selected,  they  are  located  at  0, 
20,  40,  60,  80,  and  100  (percent)  on  the  x-axis.  If  eleven  targets  are  selected,  they  are 
located  at  0,  10, 20,  30, 40,  50,  60, 70,  80, 90,  and  100  (percent)  on  the  x-axis. 

The  "Adjust"  push-button  allows  the  targets  to  be  moved  vertically  with  the 
workstation  mouse.  To  do  so,  the  user  first  presses  the  "Adjust"  button,  and  then  positions 
the  cursor  over  the  graph  at  the  desired  location  for  a  given  target.  When  the  cursor  is  at 
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Figure  3-18.    Map  Edit  window  for  time  modification  GUI. 
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Figure  3-19.    Postview  window  for  time  modification  GUI. 
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the  desired  position,  the  user  presses  the  left  mouse  button,  and  the  target  moves  to  the  new 
position  on  the  graph.  Note  that  to  allow  repeatability,  the  target  positions  are  fixed  along 
the  X-axis,  and  quantized  to  the  nearest  one-tenth  along  the  y-axis.  Thus,  each  target  can 
have  one  of  eleven  values.  If  the  user  presses  the  left  mouse  button  when  the  cursor  is  not 
directly  over  one  of  the  vertical  lines  associated  with  a  target,  the  program  determines  the 
closest  target  along  the  x-axis,  and  modifies  this  target  Once  one  target  is  moved,  the 
"Adjust"  button  must  be  pressed  again  in  order  to  further  modify  a  target  position. 

In  most  instances,  the  weighting  function  is  calculated  as  a  best-fit,  polynomial 
curve  (in  a  least-squares  sense)  that  is  fitted  to  the  targets.  The  process  of  fitting  the  curve 
to  the  targets  is  termed  "smoothing"  in  this  study.  The  order  of  the  polynomial  is  controlled 
by  the  "Smoothing"  push-button.  When  the  "Smoothing"  push-button  is  pressed,  a  pop-up 
menu  appears  (not  shown)  tiiat  displays  the  choices  "None,"  "Linear,"  "Poly_2,", 
"Poly_3,"  "Poly_4,"  and  "Poly_5."  The  current  choice  is  displayed  on  the  face  of  the 
push-button.  If  "None"  is  selected,  a  least-squares  curve  is  not  calculated.  Instead,  the 
weighting  function  is  created  by  drawing  a  straight  line  between  each  adjacent  pair  of 
targets.  An  example  of  this  is  depicted  in  the  graph  in  Figure  3-1 8.  If  "Linear"  is  selected, 
the  weighting  function  is  a  best-fit  straight  line,  i.e.  a  first-order  polynomial.  If  one  of  the 
"Poly"  choices  is  selected,  the  weighting  function  is  a  best-fit  polynomial  curve,  with  the 
order  being  determined  by  die  suffix  of  die  "Poly"  choice  (i.e.  "Poly_2,",  "Poly_3," 
"Poly_4,"  or  "Poly_5"). 

Like  many  of  the  other  windows,  when  the  Map  Edit  window  first  opens,  the  values 
last  used  for  the  particular  token  are  displayed.  If  the  user  modifies  any  of  the  values  or 
settings,  an  "Update"  button  and  a  "Cancel"  button  appear  (not  shown)  next  to  the  "OK" 
button.  To  accept  the  new  settings,  the  user  presses  the  "Update"  button.  To  reject  the 
modifications  and  return  to  the  previous  settings,  the  user  presses  the  "Cancel"  button.  The  '*4 

f .  "OK"  button  closes  the  window  without  saving  any  of  the  current  values. 
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3.7.7  Postview  Window 

The  "Postview"  button  in  the  Main  window  opens  the  Postview  window  shown  in 
Figtire  3-19.  The  window  displays  four  graphs  in  order  to  compare  the  time-modified 
speech  with  the  unmodified  speech.  The  top  graph  displays  the  original,  unmodified, 
time-domain  waveform.  The  second  graph  displays  the  original  segment  labels  and 
durations  (in  number  of  samples).  The  third  graph  displays  how  many  times  each  frame 
of  the  unmodified  speech  token  is  used  to  synthesize  the  time-modified  speech.  This  is  used 
to  confirm  that  the  speech  signal  was  modified  as  originally  specified.  In  the  example  in 
Figure  3-19,  it  can  be  seen  from  the  third  graph  that  the  first  half  of  the  nasal  segment  is 
not  used  in  the  synthesis  of  the  time-modified  speech,  and  that  all  of  the  other  frames  are 
used  once.  The  fourth  graph  is  the  time-domain  waveform  of  the  time-modified  speech. 
Note  that  by  default,  the  time-modified  waveform  derives  its  MATLAB  variable  name  by 
prefixing  the  term  "syn_"  to  the  name  of  the  original  waveform. 

3.8  Summary 

This  chapter  describes  the  algorithms  and  the  graphical  user  interface  that  are  used 
to  modify  the  speech  signal. 

The  time  modification  system  is  based  upon  a  LPC  speech  synthesizer.  The  first 
stage  of  the  system  pitch- synchronously  divides  the  speech  signal  into  frames,  and  then 
performs  a  LPC  analysis  on  each  frame.  The  result  for  each  frame  is  the  ordered  pair 
(Aj ,  Rj ),  where  i  is  the  frame  index,  Ai  is  a  1  by  N  vector  of  LPC  coefficients,  Ri  is  a  1  by 
Mi  vector  of  the  residue  signal,  and  Mj  is  the  length  (in  number  of  samples)  of  the  frame. 
During  synthesis,  the  ordered  pairs  are  sent  sequentially  to  the  LPC  speech  synthesizer. 
In  order  to  speed  up  or  slow  down  the  rate  of  the  speech,  selected  ordered  pairs  are  either 
removed  or  duplicated,  respectively,  from  the  sequence.  The  synthesizer  also  incorporates 
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a  simple  and  effective  algorithm  to  prevent  discontinuities,  or  "glitches,"  from  being 
created  in  the  output  waveform. 

The  time  modification  algorithm  independentiy  modifies  the  durations  of  the 
segments  that  comprise  the  speech  signal.  Examples  of  these  segment  types  include 
vowels,  semivowels,  unvoiced  fiicatives,  unvoiced  stops,  etc.  For  each  segment  type,  the 
user  can  specify  the  scale  factor  (SF),  and  the  minimum  duration  (MD).  In  addition,  the 
user  can  specify  a  type-independent  manual  scale  factor  (MSF)  for  each  segment.  This 
allows  a  single  occurrence  of  a  specific  type  of  segment  to  be  modified  independent  of  all 
other  occurrences  of  the  same  type.  For  example,  the  word  "man"  is  comprised  of  three 
segments:  an  initial  nasal,  a  vowel,  and  a  final  nasal.  The  user  has  the  option  of  specifying 
a  separate  MSF  parameter  for  each  of  the  three  segments.  Thus,  the  initial  /m/  can  be 
modified  independentiy  of  the  final  /n/,  even  though  both  are  the  same  segment  type, 
namely  nasal. 

An  important  feature  of  the  system  is  the  ability  to  specify  which  firames  of  a 
segment  are  removed  (or  doubled)  in  the  time  modification  process.  This  is  accomplished 
by  a  weighting  function,  or  "map,"  that  is  assigned  independently  to  each  segment  by  the 
user.  Several  fixed  maps  are  available,  and  the  user  can  also  edit  any  one  of  three  adjustable 
maps,  if  desired. 

The  time  modification  process  is  controlled  by  a  graphical  user  interface  that  is 
written  in  the  MATLAB  programming  language  and  implemented  on  a  Sun  Microsystems 
workstation.  The  GUI  is  comprised  of  multiple  windows  that  are  arranged  in  a  logical 
sequence  to  help  guide  the  user  through  the  time  modification  process.  The  user  adjusts 
the  values  of  the  time  modification  parameters  via  a  workstation  mouse  by  moving  and 
selecting  both  slide-bar  and  push-button  controls  that  are  displayed  in  the  various  GUI 
windows. 
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CHAPTER  4 
LISTENING  TESTS 


This  chapter  describes  the  listening  tests  that  were  performed.  The  goals  of  these 
tests  were  to  (1)  study  the  ability  of  the  time  modification  system  to  create  high  quality 
speech  test  tokens,  (2)  examine  the  role  of  duration  in  the  perception  of  the  fricatives  /s/ 
and  /z/  in  word-initial  position  in  single-syllable  words,  and  (3)  measure  the  effects  of 
modifying  only  specific  portions  (i.e.  the  beginning,  middle,  or  end)  of  these  fricatives. 
Both  informal  listening  tests  (pilot  studies)  and  a  formal  listening  test  were  performed. 

The  organization  of  this  chapter  is  as  follows:  First,  a  discussion  of  word-length 
versus  sentence-length  test  tokens  is  given.  The  pilot  studies  are  then  presented.  Next,  the 
factors  that  affected  the  development  of  the  formal  listening  test  are  discussed,  including 
the  selection  of  both  the  test  tokens  and  the  test  format,  and  the  selection  and  screening  of 
listeners.  The  chapter  concludes  by  presenting  the  results  of  the  formal  listening  test. 

4.1  Word-Length  versus  Sentence-Length  Test  Tokens 

In  any  listening  test,  a  choice  must  be  made  regarding  the  dilation,  or  "scope,"  of 
the  test  tokens  that  are  used.  The  choice  is  to  use  tokens  that  are  either  sentences 
(sentence-length  tokens)  or  single  words  (word-length  tokens).  This  choice  depends,  at 
least  in  part,  upon  which  type  of  test  is  being  performed:  psychological  or  phonological  (see 
Chapter  1).  In  general,  psychological  tests  use  sentence-length  (or  longer)  tokens,  while 
phonological  tests  use  word-length  (or  shorter)  tokens. 

This  study  performs  phonological  listening  tests  using  word-length  test  tokens. 
The  reasons  for  this  choice  are  as  follows:  First,  as  discussed  in  Chapter  1,  the  results  of 
phonological  tests  are  more  closely  related  to  the  acoustic  features  of  the  speech  signal  than 
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are  the  results  of  psychological  tests.  Since  the  time  modification  system  in  this  study  is 
based  upon  the  acoustic  features  of  the  speech  signal,  phonological  tests  better  demonstrate 
the  performance  and  flexibility  of  the  system.  Second,  since  word-length  tokens  typically 
have  fewer  speech  segments  than  sentence- length  tokens,  the  word-length  tokens  are  easier 
to  modify  and  test  in  a  controlled  manner.  This  is  because  the  number  of  possible 
time-modified  variations  of  a  sentence- length  token  far  oumumber  the  number  of  possible 
time-modified  variations  of  a  word-length  token.  Third,  there  is  significant  contextual 
information  contained  in  sentences  that  can  substantially  influence  a  listener's  perception, 
and  as  a  result,  can  adversely  influence  test  results.  This  information  includes  the 
constraints  imposed  by  grammar  and  vocabulary  (Voiers,  1983).  Contextual  information 
is  important  because  it  can  enable  a  listener  to  determine,  or  reconstruct,  the  identity  of  an 
unintelligible  word  in  a  sentence  based  upon  the  words  and  phrases  that  surround  the 
unintelligible  word.  False  conclusions  can  be  reached  if  contextual  information  is  present 
in  listening  tests,  since  the  intelligibihty  of  a  sentence  is  dependent  not  only  upon  the 
intelligibility  of  the  individual  words,  but  also  upon  the  listener's  higher-level  cognitive 
processes  that  utilize  contextual  information  in  an  attempt  to  make  the  sentence  "make 
sense." 

Once  word-length  tokens  are  chosen,  a  decision  must  be  made  to  use  either 
single-syllable  or  multi-syllable  words.  This  study  uses  single- syllable  words  primarily 
for  the  reason  that  single-syllable  words  are  easier  to  modify  and  test  in  a  controlled  manner 
than  are  multi- syllable  words.  This  is  because  single-syllable  words  are  typically 
comprised  of  fewer  speech  segments  than  are  multi-syllable  words.  As  a  result,  there  are 
fewer  possible  variations  of  the  resulting  time-modified  tokens  for  single- syllable  words 
than  there  are  for  multi- syllable  words. 

Researchers  have  developed  several  standardized  tests  that  use 
specially-constructed,  monosyllabic  word  lists  in  order  to  test  the  intelligibility  of  speech 
(ANSIS3.2-1989, 1989;  Egan,  1948;  Fairbanks,  1958;  Voiers,  1983).  The  three  tests  most 
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often  used  are  the  Phonetically  Balanced  Word  List  Test,  the  Modified  Rhyme  Test,  and 
the  Diagnostic  Rhyme  Test.  These  tests,  however,  were  created  specifically  to  measure 
speech  intelligibility  over  communications  channels.  As  a  result,  they  each  contain  a 
relatively  large  number  of  test  tokens.  Due  to  the  large  number  of  time-modified  variations 
of  each  token  that  are  created  for  this  listening  test,  none  of  these  three  tests  are  suitable 
for  use  in  this  study.  This  is  because  an  extremely  large  number  (i.e.  thousands)  of  final 
test  tokens  would  be  generated  if  any  of  the  three  standardized  tests  are  used  in  their 
entirety.  Note,  however,  that  portions  of  the  Diagnostic  Rhyme  Test  (DRT)  were  used  in 
pilot  studies  to  investigate  the  capabilities  of  the  time  modification  system,  as  well  as  to 
help  choose  the  final  word  list  for  the  formal  listening  test. 

The  next  section  describes  the  pilot  studies  that  were  performed.  These  pilot  studies 
utilized  the  time  modification  system  described  in  Chapter  2  and  Chapter  3  to  create 
numerous  time-nxxiified  versions  of  the  DRT.  Later  sections  describe,  in  detail,  the 
processes  that  were  followed  to  select  both  the  test  tokens  and  the  test  format  for  the  formal 
listening  test 

4.2  Pilot  Smdies 

Pilot  studies  were  performed  informally  to  test  the  capabilities  of  the  time 
modification  system  and  to  help  choose  the  test  tokens  for  the  formal  listening  test. 

In  order  to  test  an  adequate  quantity  and  variety  of  phonemes,  three  copies  of  the 
Diagnostic  Rhyme  Test  (DRT)  were  purchased  from  the  Dynastat  Corporation,  Austin, 
Texas.  The  word  lists  were  spoken  by  one  female  and  two  male  speakers.  A  copy  of  the 
DRT  word  list  is  given  in  Appendix  A.  The  DRT  contains  192  words;  therefore,  a  total  of 
576  unmodified  word  tokens  were  available  for  the  pilot  studies.  The  words  were  digitized 
from  a  digital  audio  tape  (DAT)  source  widi  a  16-bit,  Ariel,  Pro-Port  Model  656 
analog- to-digital  (A/D)  converter.    The  sampling  frequency  was  10  kHz.    The  A/D 
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converter  was  connected  to  an  Ariel  DSP-32C  digital  signal  processing  board  that  was 
installed  in  a  Sun  Microsystems  IPX  workstation. 

The  three  listeners  who  participated  in  the  pilot  studies  were  all  from  the  University 
of  Florida.  The  first  was  a  Ph.D.  Candidate  in  Electrical  Engineering,  the  second  was  a 
Professor  of  Electrical  Engineering,  and  the  third  was  a  Professor  of  Communication 
Processes  and  Disorders.  All  were  male,  and  all  reported  having  no  known  hearing 
disorders. 

The  time-modified  test  tokens  were  created  using  the  time  modification  system 
developed  in  this  study.  The  tokens  were  converted  from  a  sampled-data  waveform  to  an 
analog  signal  using  a  16-bit,  Ariel,  Pro-Port  Model  656  digital-to-analog  (D/A)  converter. 
The  sampling  frequency  was  10  kHz.  The  D/A  converter  was  connected  to  an  Ariel 
DSP-32C  digital  signal  processing  board  that  was  installed  in  a  Sun  Microsystems  IPX 
workstation.  The  test  tokens  were  presented  over  Sony  MDR-CD888  headphones  at  a 
comfortable  listening  level  in  a  quiet  room. 

The  pilot  studies  were  grouped  according  to  the  speech  segment  categories  that 
were  investigated  (i.e.  vowel,  nasal,  etc.).  For  example,  one  set  of  tests  examined  the 
effects  of  removing  various  portions  of  the  initial  nasal  segment  from  a  subset  of  the  DRT 
that  contained  words  with  nasals  in  the  word-initial  position.  Another  test  examined  the 
effects  of  removing  different  portions  of  the  initial  stop  from  a  subset  of  the  DRT  that 
contained  words  with  stops  in  the  word-initial  position.  Note  that  the  word  pairs  in  the  DRT 
differ  only  in  the  initial  consonant.  Therefore,  in  the  majority  of  the  pilot  studies,  only  the 
duration  of  the  initial  consonant  was  varied.  This  helped  to  narrow  the  focus  of  the  research 
by  reducing  the  number  of  ways  that  the  words  were  modified  and  tested. 

A  description  of  the  specific  pilot  studies  and  a  discussion  of  the  results  of  these 
experiments  are  included  in  the  following  sub-sections.  Note  that  while  the  tests  were 
conducted  informally,  the  results  were  related  to  the  results  of  other  formal  studies,  when 
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applicable.    This  was  done  to  verify  that  the  time  modification  system  performed  as 
expected. 

4.2.1  Quality 

In  all  of  the  pilot  studies,  particular  attention  was  paid  to  the  overall  "quality"  of 
the  resulting  synthesized,  time-modified  speech  (note  that  the  definition  of  quality  in  this 
study  denotes  "naturalness"  or  "lifelikeness").  After  each  test  was  conducted,  the  listeners 
were  asked  if  they  had  detected  any  artifacts  or  distortion  in  the  synthesized  tokens.  In  all 
cases,  the  listeners  replied  that  the  synthesized  speech  was  indistinguishable  from  the 
natural  speech,  insofar  as  the  quality  of  the  synthesized  tokens  was  concerned.  Obviously, 
when  significant  portions  of  the  words  were  removed,  the  time-modified  tokens  were 
distinguishable  from  the  original  speech,  but  this  was  attributed  by  the  listeners  to  the 
changes  in  duration  and  not  to  the  quality  of  the  time-modified  speech. 

The  high  quality  of  the  synthesized  tokens  was  also  confirmed,  in  part,  by 
examination  of  both  the  time-domain  waveforms  and  the  spectrograms  of  the  original  and 
time-modified  speech  signals.  The  examination  revealed  no  noticeable  discontinuities  or 
"glitches"  in  the  time-domain  waveforms,  and  no  noticeable  changes  in  the  frequency 
distributions  between  the  original  and  time-modified  spectrograms  (i.e.  the  formant 
bandwidths  remained  unchanged,  as  did  the  overall  spectral  tilt). 

4.2.2  Nasals 

The  first  pilot  study  examined  the  effect  of  removing  various  portions  of  the  initial 
nasal  in  CV  and  CVC  words  from  the  DRT.  The  results  showed  that  the  nasals  were  robust, 
and  that  they  could  still  be  perceived  correcdy  after  they  were  reduced  to  approximately 
25%  of  their  original  duration.  This  result  was  attributed  to  the  coarticulation  effects 
exhibited  by  the  nasals  that  often  caused  the  following  vowels  to  be  nasalized.  In  addition, 
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for  a  given  time-modified  nasal  duration,  the  results  were  the  same  regardless  of  the  portion 
of  the  nasal  that  was  removed  (i.e.  beginning,  middle,  or  end). 

For  word-initial  nasals,  perception  of  the  place  of  articulation  has  been  attributed 
to  the  formant  transition  from  the  nasal  to  the  following  vowel  (Cooper  et  al.,  1952; 
Malecot,  1956).  This  was  investigated  in  a  second  pilot  study  that  removed  both  a  portion 
of  the  nasal  consonant,  as  well  as  the  entire  formant  transition  region  from  the  following 
vowel.  Note  that  the  formant  transition  typically  occurred  during  the  first  five  to  ten  pitch 
periods  of  the  vowel.  The  test  tokens  for  this  experiment  were  the  same  set  of  words  that 
were  used  in  the  first  pilot  study. 

The  results  depended  upon  the  particular  vowel  as  well  as  the  duration  of  the  nasal. 
When  the  entire  formant  transition  region  was  removed  from  the  vowel,  correct  perception 
of  the  nasal  depended  upon  the  nasal  duration.  If  the  nasal  duration  was  relatively  short 
(less  than  about  50%  of  the  original  duration),  it  was  difficult  to  correctiy  identify  the  place 
of  articulation.  If  the  nasal  duration  was  relatively  long  (greater  than  about  50%  of  the 
original  duration),  the  place  of  articulation  was  easier  to  detect 

These  results  were  also  affected  by  the  identity  of  the  vowel.  When  the  entire 
formant  transition  was  removed  from  the  vowel,  and  part  of  the  nasal  was  also  removed, 
the  nasal  /m/  in  the  word  "meat,"  for  example,  was  slightiy  more  robust  than  the  nasal  /m/ 
in  the  word  "mad,"  for  example.  This  finding  that  the  perception  of  nasals  was,  in  part, 
dependent  upon  the  identity  of  the  following  vowel  is  consistent  with  the  findings  of  Ali, 
Gallagher,  Goldstein,  and  Daniloff  (1971). 

4.2.3  Stops 

The  third  pilot  study  examined  the  effect  of  removing  various  portions  of  initial 
stop  consonants  in  CV  and  CVC  words  from  the  DRT.  The  results  showed  that  the  stops 
were  not  robust  to  changes  in  duration.  The  perceived  identity  of  the  initial  stop  changed 
dramatically  as  the  duration  was  reduced.    Note  that  in  this  pilot  study,  quantitative 


128 


measurements  of  the  durations  that  were  required  to  bring  about  changes  in  perception  of 
the  initial  stops  were  not  recorded.  Instead,  only  the  general  trend  (i.e.  the  sequence  of 
different  phonemes  that  was  perceived)  was  recorded  as  the  duration  of  the  initial  stop  was 
reduced  from  100%  to  0%  of  its  original  value.  Several  examples  illustrate  this  change  in 
phoneme  perception  as  the  duration  was  reduced:  In  the  word  "tense,"  perception  of  the 
initial  consonant  changed  from  IxJ  to  /p/,  and  then  from  /p/  to  /b/.  In  the  word  "tint," 
perception  of  the  initial  consonant  changed  directly  from  IxJ  to  /b/.  In  the  word  "peen," 
perception  of  the  initial  consonant  changed  from  /p/  to  /b/.  In  the  word  "coat,"  perception 
of  the  initial  consonant  changed  from  /k/  to  /p/,  and  then  from  /p/  to  /b/.  Note  that  in  all 
of  these  examples,  the  perceived  phoneme  category  changed  from  an  unvoiced  stop  to  a 
voiced  stop  as  the  duration  of  the  initial  consonant  was  reduced  from  100%  to  0%  of  its 
original  value.  This  result  supports  the  theory  that  one  of  the  factors  that  influences  the 
voiced/unvoiced  distinction  of  stops  is  the  duration  of  the  aspiration  noise  and  the 
subsequent  effect  upon  the  voicing  onset  time,  VOT  (Lisker  and  Abramson,  1964). 

An  interesting  result  of  this  pilot  study  was  that  perception  also  depended  upon  the 
portion  of  the  initial  consonant  that  was  preserved  (i.e.  the  beginning,  middle,  or  end).  In 
the  examples  described  above,  the  frames  of  the  initial  consonant  segment  that  were 
eliminated  were  selected  randomly  using  the  Random  map  described  in  Section  3.4.  Note, 
however,  that  a  second  map  (the  Fixed_2  map)  was  also  used  to  investigate  the  effects  of 
preserving  only  the  final  portion  of  the  initial  stop  consonant  Results  showed  that  for  a 
fixed  consonant  duration,  the  relative  position  (in  the  time-domain)  of  the  portion  of  the 
consonant  that  was  removed  altered  perception  of  many  of  the  test  tokens. 

For  example,  the  word  "taunt"  was  used  to  create  two  time-modified  tokens  with 
the  same  initial  consonant  duration  (approximately  50%  of  the  unmodified  duration).  The 
first  token  preserved  the  final  portion  of  the  word-initial  A/  (using  the  Fixed_2  map),  while 
the  second  token  preserved  a  random  selection  of  5  ms  frames  from  the  word-initial  IxJ 
(using  the  Random  map).   For  the  token  that  preserved  the  final  portion  of  the  JxJ,  the 
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resulting  time-modified  consonant  was  perceived  to  be  a  /p/.  For  the  token  that  preserved 
random  fi^mes  of  the  l\J,  the  resulting  time-modified  consonant  was  perceived  to  be  a  /t/. 
These  same  results  were  also  observed  for  the  word  "tense." 

4.2.4  Fricatives 

The  fourth  pilot  study  examined  the  effect  of  removing  various  portions  of  initial 
fricatives  in  CV  and  CVC  words  from  the  DRT.  The  results  varied,  and  depended  upon 
the  identity  and  the  voicing  attribute  of  the  original  fricative.  As  in  the  third  pilot  study, 
quantitative  measurements  of  the  durations  that  were  required  to  bring  about  changes  in 
perception  of  the  fricatives  were  not  recorded.  Instead,  only  the  general  trend  (i.e.  the 
sequence  of  different  phonemes  that  was  perceived)  was  recorded  as  the  duration  was 
reduced  fi"om  100%  to  0%  of  its  original  value.  Several  examples  illustrate  this  change  in 
perception  as  the  duration  was  reduced.  The  examples  are  divided  into  two  groups 
according  to  the  original  voicing  attribute  of  the  fticative.  The  unvoiced  examples  are 
given  first:  In  the  word  "feel,"  perception  of  the  initial  consonant  changed  from  /f/  to  /t/, 
then  fi-om  /t/  to  /p/,  and  finally  ftx)m  /p/  to  /b/.  In  the  word  "foal,"  perception  of  the 
consonant  changed  from  /f/  to  /p/.  In  this  example,  the  /f/  was  relatively  robust,  and 
perception  changed  fix)m  /f/  to  /p/  only  when  the  duration  of  the  consonant  was  reduced 
to  less  than  approximately  10%  of  its  original  value.  In  the  word  "said,"  perception  of  the 
consonant  changed  from  /s/  to  /t/,  then  from  /t/  to  /d/,  and  finally  from  /d/  to  /b/.  In  the  word 
"sing,"  perception  of  the  consonant  changed  from  /s/  to  IzJ  (or  to  a  concatenation  of /t/  and 
Izl,  as  reported  by  one  listener),  and  then  fi-om  IzJ  to  Idl.  In  both  of  these  examples,  the  /s/ 
was  robust,  and  perception  changed  to  a  phoneme  other  than  /s/  only  when  the  duration  of 
the  consonant  was  reduced  to  less  than  approximately  25%  of  its  original  value.  In  the  word 
"thin,"  perception  of  the  consonant  did  not  change  fi"om  the  unvoiced  161  until  the  duration 
was  reduced  to  less  than  approximately  10%  of  its  original  value.  At  this  value,  the  161  was 
perceived  to  completely  disappear,  and  the  word  "in"  was  heard. 
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In  general,  these  results  show  that  as  the  duration  of  the  initial  unvoiced  fricative 
was  decreased,  the  resulting  perceived  phoneme  was  either  a  voiced  fricative  or  a  voiced 
stop.  This  implies  that  duration  affects  the  perceived  voicing  of  fricatives,  which  is  in 
agreement  with  the  results  of  other  studies  (Cole  and  Cooper,  1975;  Denes,  1955;  Grimm, 
1966;  Jongman,  1989;  Miller  and  Nicely,  1955). 

It  was  observed  that  when  the  durations  of  the  initial  voiced  fricative  consonants 
were  reduced,  these  fricatives  were  much  more  resistant  to  changes  in  perceived  identity 
than  were  the  unvoiced  fricatives.  Several  examples  demonstrate  this:  In  the  word  "vox," 
perception  of  the  initial  consonant  did  not  change  from  /v/  as  the  duration  of  the  consonant 
was  reduced.  This  effect  was  also  observed  for  both  the  /zj  in  the  word  "zee,"  and  for  the 
/6/  in  the  word  "than."  For  the  word  "zoo,"  the  initial  /z/  did  not  change  to  another 
phoneme,  but  rather,  completely  disappeared  when  the  duration  was  reduced  to  less  than 
approximately  10%  of  its  original  value.  In  general,  these  results  are  inconsistent  with  the 
findings  of  Grimm  ( 1 966) ,  who  found  that  as  the  duration  of  an  initial  voiced  fricative  was 
reduced,  perception  changed  from  a  voiced  fricative  to  a  voiced  stop. 

4.3  Development  of  the  Formal  Listening  Test 

The  formal  listening  test  was  performed  to  quantitatively  measure  the  effects  of 
duration  upon  the  perception  of  several  phonemes.  The  test  also  investigated  the  effects 
of  modifying  only  specific  portions  (i.e.  beginning,  middle,  or  end)  of  these  phonemes. 
During  development  of  the  formal  test,  decisions  were  made  regarding  (1)  the  speech 
tokens  tiiat  were  used,  (2)  the  portions  of  the  tokens  that  were  modified,  (3)  the  various 
versions  of  the  modified  tokens  that  were  created,  (4)  the  number  of  listeners  that  were 
required,  and  (5)  the  requirements  and/or  screening  that  was  imposed  upon  the  listeners. 

This  section  presents  the  choices  that  were  made,  and  describes  the  factors  that 
influenced  these  decisions.  Where  applicable,  the  results  of  previous  phonological  studies 
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were  examined  to  ensure  that  the  scope  of  the  formal  listening  test  used  in  this  study  was 
similar  to  that  of  previous  research. 

4.3.1  Test  Tokens 

This  section  discusses  the  tokens  that  were  used  in  the  formal  listening  test.  It 
details  the  steps  that  were  followed  in  the  selection  of  (1)  the  type  of  words  that  were 
modified,  (2)  the  durations  of  the  time-modified  tokens  that  were  studied,  and  (3)  the 
portions  of  the  words  that  were  modified.  It  also  discusses  the  limitations  of  the  time 
modification  system  used  to  synthesize  the  test  tokens. 

4.3.1.1  Type 

Table  4-1  gives  a  summary  of  formal  listening  tests  conducted  by  other  researchers, 
and  lists  the  type  of  tests  (in  terms  of  the  specific  test  tokens  that  were  used),  the 
modification  methods,  and  the  durations  of  the  test  tokens.  It  can  be  seen  from  this  table 
that  each  study  focused  on  only  one  or  two  phoneme  categories.  In  addition,  the  order  and 
occurrence  of  the  phonemes  in  the  test  tokens  were  tightly  controlled.  For  example.  Cole 
and  Cooper  (1975)  varied  the  duration  of  the  initial  portion  of  six  consonants  in  word-initial 
position  in  CV  syllables.  They  spliced  magnetic  tape  to  create  six  test  tokens  for  each 
syllable.  Each  test  token  had  a  different  consonant  duration.  The  "step  size,"  or  difference 
in  the  consonant  duration  between  two  successive,  time-modified  tokens  of  the  same 
syllable,  was  calculated  by  dividing  the  original,  unmodified  consonant  duration  into  six 
equal  parts.  Thus,  the  step  size  was  not  consistent  across  the  various  CV  syllables,  since 
each  original,  unmodified  syllable  had  a  different  length. 

In  the  present  study,  the  effects  of  varying  the  duration  of  both  voiced  and  unvoiced 
fricatives  in  word-initial  position  in  CV  and  CWC  syllables  were  investigated.  To  further 
restrict  the  study,  only  the  fricatives  /s/  and  /z/  were  examined.  The  reasons  for  these 
choices  were  as  follows:  First,  this  was  similar  in  scope  for  many  of  the  other  research 
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Studies.  Second,  this  limited  the  number  of  final  test  tokens  that  had  to  be  created  and 
tested.  Since  each  word  was  modified  approximately  70  times,  it  was  important  to  limit 
the  number  of  unmodified  test  words.  Third,  the  pilot  studies  showed  that  duration  was 
a  principle  factor  in  the  perception  of  voicing  of  unvoiced  fricatives,  and  was  not  a  factor 
in  the  perception  of  voiced  fricatives.  Thus,  a  hypothesis  existed  that  could  be  proved  or 
disproved.  Fourth,  since  several  previous  studies  examined  this  same  hypothesis,  there 
were  previous  results  that  could  be  compared  with  the  results  of  this  study. 

The  words  "sue,"  "zoo,"  "said,"  and  "zed"  were  chosen  from  the  DRT  as  the  four 
test  words  to  be  modified.  The  words  were  spoken  individually,  with  no  carrier  sentence, 
by  a  male  speaker.  Figure  4-1  shows  the  time-domain  waveforms,  and  Figure  4-2  shows 
the  wideband  spectrograms  for  these  four  unmodified  words.  The  "sue/zoo"  CV  pair 
incorporated  a  Up-rounded  vowel,  and  the  "said/zed"  CVC  pair  incorporated  a  relatively 
neutral  vowel.  Although  the  structure  of  the  four  test  words  were  not  the  same  (i.e.  two 
were  CV  and  two  were  CVC),  this  did  not  adversely  affect  the  results.  This  is  because  both 
the  duration  of  the  vowel  and  the  duration  of  the  following  consonant  (in  the  CVC  words) 
were  not  modified. 

The  amplitudes  of  the  four  unmodified  words  were  scaled  before  the  test  tokens 
were  created  so  that  the  vowels  in  each  of  the  words  had  the  same  maximum 
root-mean-square  (RMS)  value  (within  ±  0.5  dB).  This  ensured  that  the  words  were  all 
presented  at  approximately  the  same  listening  level. 

4.3.1.2  Duration 

Many  of  the  studies  in  Table  4-1  created  a  set  of  test  tokens  by  incrementing  the 
duration  of  the  modified  phoneme  in  10  ms  steps  horn  0%  to  100%  of  the  original  duration. 
This  was  also  the  method  that  was  used  in  this  study.  The  choice  of  10  ms  as  the  "step  size" 
was  investigated  during  the  pilot  studies.  In  most  cases,  when  the  duration  of  a  word-initial 
phoneme  was  repeatedly  decremented  in  10  ms  steps  from  100%  to  0%  of  the  unmodified 
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Figure  4—1 .      Time-domain  waveforms  of  the  original,  unmodifed,  test  words. 

a)  Sue; 

b)  Zoo; 

c)  Said; 
d)Zed. 
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Figure  4—1 — continued. 
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Figure  4—2.      Spectrograms  of  the  original,  unmodifed,  test  words. 

a)  Sue; 

b)  Zoo; 

c)  Said; 
d)Zed. 
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Figure  4—2 — continued. 
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duration,  the  listener  was  able  to  identify  a  gradual  change  in  the  identity  of  the  phoneme. 
The  changes  in  perception  were  more  obvious  and  sudden  when  the  time-modified 
consonant  duration  was  very  short  (less  than  60  ms)  than  they  were  at  longer  consonant 
durations.  Thus,  the  value  of  10  ms  was  chosen  as  a  compromise.  It  allowed  both  a 
moderate  number  of  tokens  to  be  created  for  each  word  (since  the  total  number  of  tokens 
increased  as  the  step  size  decreased),  and  also  provided  sufficient  resolution  for  measuring 
gradual  changes  in  the  identity  of  the  time-modified  phonemes. 

4.3. 1.3  Position 

An  additional  test  factor  observed  during  the  pilot  studies  was  that  for  a  given 
duration,  the  position  (i.e.  beginning,  middle,  or  end)  of  the  portion  of  the  phoneme  that 
was  preserved  had  a  significant  effect  upon  the  perception  of  the  identity  of  the 
time-modified  phoneme.  This  effect  was  not  studied  in  the  tests  listed  in  Table  4-1.  Due 
to  the  apparent  importance  of  this  factor  in  the  perception  process,  it  was  investigated  in 
this  study  by  the  following  procedure:  For  each  word,  and  for  each  consonant  duration, 
three  tokens  were  created:  The  first  token  preserved  the  initial  portion  of  the  initial 
consonant,  the  second  token  preserved  the  middle  portion  of  the  initial  consonant,  and  the 
third  token  preserved  the  final  portion  of  the  initial  consonant. 

An  example  of  this  is  shown  in  Figure  4—3,  where  three  test  tokens  were  created 
ft-om  the  unmodified  word  "sue."  In  this  example,  the  duration  of  the  /s/  was  modified  to 
be  equal  to  50  ms  for  each  of  the  three  tokens.  Since  only  the  duration  of  the  initial 
consonant  was  modified,  the  figure  shows  only  the  initial  consonant  and  the  transition  to 
the  following  vowel.  In  (b).  Token  1  preserves  the  initial  50  ms  of  the  original  /s/.  In  (c). 
Token  2  preserves  the  middle  50  ms  of  the  original  /s/.  In  (d).  Token  3  preserves  the  final 
50  ms  of  the  original  /s/. 

Figure  A-A  shows  the  wideband  spectrograms  for  the  time-domain  waveforms 
shown  in  Figure  4-3.  The  spectrograms  were  created  using  a  Hamming  window  with  a 
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Figure  4-3.      Test  token  generation  example.    Time-domain  waveforms  for  the  word 
"sue."  Only  the  initial  consonant  and  a  portion  of  the  vowel  are  shown. 

a)  Original,  unmodified  word; 

b)  Token  "1"  with  the  initial  50  ms  of  the  original  /s/  preserved; 

c)  Token  "2"  with  the  middle  50  ms  of  the  original  /s/  preserved; 

d)  Token  "3"  with  the  final  50  ms  of  the  original  /s/  preserved. 
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Figure  4-3 — continued. 
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Figure  4—4.      Test  token  generation  example.  Spectrograms  for  the  word  "sue."  Only  the 
initial  consonant  and  a  portion  of  the  vowel  are  shown. 

a)  Original,  unmodified  word; 

b)  Token  "1"  with  the  initial  50  ms  of  the  original  /s/  preserved; 

c)  Token  "2"  with  the  middle  50  ms  of  the  original  /s/  preserved; 

d)  Token  "3"  with  the  final  50  ms  of  the  original  /s/  preserved. 
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duration  of  8.0  ms.  As  a  result,  the  frequency  resolution  is  125  Hz.  The  analysis  window 
was  advanced  2.0  ms  after  each  short-term  Fourier  transform,  which  corresponds  to  a 
window  overlap  of  75%.  These  analysis  parameters,  as  well  as  the  fact  that  a  relatively 
short  portion  of  the  speech  signal  (350  ms)  is  depicted,  account  for  the  "coarseness"  of  the 
spectrogram. 

Note  that  in  Figure  4—3,  the  signal  energy  or  "envelope,"  of  the  /s/  varies  for  each 
of  the  three  time-modified  tokens.  In  (b),  the  envelope  rises  slowly  and  decays  quickly. 
In  (c),  the  envelope  rises  quickly  and  decays  quickly.  In  (d),  the  envelope  rises  quickly  and 
decays  slowly.  It  is  hypothesized  that  these  different  envelopes  are  responsible  for  the 
disagreement  in  the  perceived  identity  of  the  initial  consonant  in  the  three  tokens  (the  initial 
consonant  in  (b)  is  typically  perceived  as  the  fricative  161,  while  the  initial  consonants  in 
(c)  and  (d)  are  both  typically  perceived  as  the  stop  Ixf). 

4.3.1.4  Synthesis  and  time  resolution 

Calculation  of  the  total  number  of  test  tokens  was  done  as  follows:  The  unmodified 
durations  of  the  initial  fiicatives  in  the  words  "sue,"  "zoo,"  "said,"  and  "zed,"  were  241.4, 
236.9, 192.7,  and  174.1  ms,  respectively.  Multiple  tokens  were  created  for  each  word  by 
varying  the  duration  of  the  initial  consonant  in  10  ms  intervals.  For  each  duration,  three 
test  tokens  were  created.  The  three  tokens  corresponded  to  the  three  positions  (i.e. 
beginning,  middle,  or  end)  of  the  portion  of  the  consonant  that  was  preserved.  This  is 
summarized  in  Table  4—2.  A  total  of  270  tokens  were  created. 

There  is  an  important  issue  regarding  the  durations  of  the  test  tokens  that  must  be 
explained.  For  many  of  the  test  tokens,  the  duration  of  the  time-modified  consonant  was 
not  equal  to  an  exact  multiple  of  10  ms.  The  reason  for  this  is  related  to  the  time  resolution 
of  the  time  modification  system  that  was  developed  for  this  study.  Some  examples  help 
to  illustrate  this  point:  If  the  initial  consonant  was  voiced,  as  in  the  words  "zoo"  and  "zed," 
the  time  modification  system  preserved  the  integer  number  of  voiced  fi^ames  that  resulted 
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in  a  consonant  duration  that  was  as  close  as  possible  to  the  desired  duration  (as  stated 
previously,  the  desired  duration  was  always  a  multiple  of  10  ms).  Since  the  pitch  period 
of  the  (voiced)  consonant  frames  varied,  and  was  typically  not  a  multiple  of  10  ms,  the 
resulting  duration  of  an  integer  number  of  voiced  frames  did  not  necessarily  equal  a 
multiple  of  10  ms. 

Table  4-2.       Summary  of  tokens  for  formal  listening  test. 


Word 

Unmodified 

Initial 

Consonant 

Duration 

(ms) 

Set  of  Desired  Durations  of 

Time-Modified  Initial 

Consonant  (ms) 

Total  Number 

of 

Time-Modified 

Durations 

Total 

Number  of 

Tokens 

sue 

241.4 

{  0, 10, 20,  30, ... ,  240  } 

25 

75 

zoo 

236.9 

{  0, 10, 20,  30, ... ,  240  } 

25 

75 

said 

192.7 

{  0,  10, 20,  30, ... ,  200  } 

21 

63 

zed 

174.1 

{  0, 10,  20,  30, ... ,  180  } 

19 

57 

Likewise,  if  the  initial  consonant  was  unvoiced,  as  in  the  words  "said"  and  "sue," 
the  time  modification  system  preserved  the  integer  number  of  unvoiced  frames  that 
resulted  in  a  consonant  duration  that  was  as  close  as  possible  to  the  desired  duration.  In 
most  instances,  the  resulting  unvoiced  consonant  duration  was  equal  to  the  desired 
duration,  since  for  the  large  majority  of  unvoiced  frames,  the  fi-ame  duration  was  equal  to 
5  ms. 

However,  due  to  the  method  used  in  the  LPC  analysis  algorithm,  two  unvoiced 
frames  were  often  created,  each  with  a  diu-ation  of  less  than  5  ms.  The  reason  for  this  is 
as  follows:  The  LPC  analysis  algorithm  had  to  account  for  the  fact  that  the  unmodified 
duration  of  the  original,  unvoiced  segment  was  usually  not  equal  to  a  multiple  of  5  ms.  As 
a  result,  a  single  "leftover"  unvoiced  frame  with  a  duration  less  than  5  ms  was  often  created. 
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Since  the  duration  of  this  leftover  frame  was  often  extremely  short  (less  than  10  samples), 
it  was  not  possible  to  perform  a  LPC  analysis  on  the  frame.  Therefore,  the  LPC  analysis 
algorithm  actually  created  two  leftover  frames.  This  was  done  by  combining  the  original 
leftover  frame  with  an  adjacent,  unvoiced  frame  (the  adjacent  frame  had  a  duration  of  5 
ms).  The  combined  frame  was  then  divided  in  half  to  create  two,  equal-length,  leftover 
frames,  each  with  a  duration  less  than  5  ms.  This  guaranteed  that  the  length  of  each  of  the 
leftover  frames  was  sufficient  for  a  LPC  analysis. 

As  a  result,  if  either  of  the  the  leftover  frames  was  included  in  the  synthesis  of  a 
time-modified  token,  the  duration  of  the  unvoiced  consonant  in  the  token  was  not  equal  to 
a  multiple  of  10  ms.  For  this  case,  the  token  with  a  consonant  duration  that  was  as  close 
as  possible  to  the  desired  duration  was  automatically  created. 

Despite  the  fact  that  the  actual  durations  of  the  time-modified  consonants  were  not 
always  equal  to  a  multiple  of  10  ms,  the  differences  that  resulted  were  ignored  for  the 
following  reasons:  First,  the  difference  between  the  desired  and  actual  duration  of  the 
initial  consonant  was  usually  only  one  or  two  milliseconds.  For  the  test  tokens  that  had  a 
consonant  duration  greater  than  40  ms,  this  difference  was  less  than  5%.  For  the  test  tokens 
that  had  a  consonant  duration  greater  than  1(X)  ms,  this  difference  was  less  than  2%. 
Second,  a  difference  of  only  one  or  two  milliseconds  has  been  determined  to  be 
unnoticeable  to  listeners  (Huggins,  1972).  Third,  when  compared  to  many  of  the  previous 
tests,  a  difference  of  approximately  2%  is  quite  small.  For  example,  the  tokens  that  were 
used  in  previous  studies  were  often  recorded  and  edited  on  one-quarter  inch  magnetic  tape 
at  a  speed  of  seven  and  one-half  inches  per  second.  A  standard  45-degree  splice  angle  was 
typically  used  to  "cross  fade"  at  the  point  where  a  portion  of  the  phoneme  was  removed, 
which  was  often  at  the  boundary  between  two  dissimilar  phonemes.  The  splicing  was  done 
at  an  angle  to  prevent  audible  discontinuities  from  being  created.  As  a  result  of  the  tape 
speed  and  the  angle  of  the  splice,  the  duration  of  the  portion  of  the  tape  that  contained  the 
splice  was  33.3  ms.  Thus,  there  was  an  ambiguity  of  33.3  ms  in  the  exact  duration  of  the 
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phoneme,  since  the  perceived  boundary  between  the  two  phonemes  did  not  necessarily 
occur  at  the  middle  of  the  splice  (due  to  possible  auditory  masking  effects  associated  with 
the  dissimilar  signals  on  either  side  of  the  splice). 

4.3.2  Test  Format 

Although  they  were  not  used  in  the  formal  listening  test,  the  three  standardized  tests 
listed  in  Section  4. 1  were  examined  during  the  development  of  the  test  format.  All  three 
of  the  standardized  tests  are  closed-response  tests.  There  are  several  reasons  for  this:  First, 
this  type  of  test  is  reliable,  and  enables  the  investigator  to  systematically  analyze  the  test 
results.  This  is  because  the  number  and  type  of  responses  are  limited  in  advance  to 
represent  words  that  contain  only  the  acoustic  features  of  interest.  Second,  the  tests  are  easy 
to  administer  and  score.  Since  the  list  of  possible  answers  is  predetermined,  it  is  possible 
to  automate  the  testing  and  scoring  processes.  This  gready  simplifies  and  accelerates  the 
testing  and  scoring  procedures,  especially  when  a  large  number  of  listeners  are  used. 

This  test  also  used  a  closed-response  answer  set.  Note,  however,  that  it  differed 
slighdy  from  a  typical  closed-response  test  in  that  all  of  the  possible  answers  that  were 
presented  visually  to  the  listeners  were  not  included  in  the  set  of  words  that  were  played. 
However,  the  original  words  that  were  used  to  create  the  time-modified  test  tokens  were 
included  in  the  list  of  possible  answers. 

For  each  test  token,  the  listener  was  presented  visually  with  a  list  of  nine  words,  or 
choices.  For  consistency,  the  set  of  possible  answers  for  each  of  the  two  word  pairs 
("sue/zoo"  and  "said/zed")  contained  words  with  the  same  list  of  initial  consonants.  Each 
answer  differed  from  the  original,  unmodified  word  only  in  the  initial  phoneme.  The  set 
of  possible  answers  for  the  words  "sue"  and  "zoo"  were:  "boo,"  "doo,"  "lou,"  "ooh," 
"poo,"  "sue,""two,""thoo,"  and  "zoo."  Likewise,  the  set  of  possible  answers  for  the  words 
"said"  and  "zed"  were:  "bed,"  "dead,"  "ed,"  "led,"  "ped,"  "said,"  "ted,"  "thed,"  and  "zed." 
Note  that  all  of  the  words  in  a  given  list  rhyme  with  one  another. 
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The  set  of  possible  answers  was  critical  to  the  test  design.  Incorrect  conclusions 
could  have  resulted  if  the  set  of  choices  did  not  account  for  all  probable  answers. 
Unfortunately,  this  factor  is  usually  only  discussed  in  the  literature  from  a 
testing-efficiency  viewpoint  (i.e.  the  more  choices  a  listener  has,  the  more  time  he/she  must 
be  given  to  respond).  Therefore,  considerable  time  was  spent  in  an  attempt  to  determine 
all  of  the  probable  answers.  The  two  lists  of  answers  were  developed  during  repeated 
listening  of  the  entire  set  of  270  time-modified  test  tokens  by  the  three  listeners  who 
participated  in  the  pilot  studies.  Each  list  contained,  in  word-initial  position:  two  unvoiced 
stops  (/p/  and  /t/),  two  voiced  stops  (/b/  and  /d/),  one  liquid  (/!/),  two  unvoiced  fricatives 
(/s/  and  161),  one  voiced  fricative  (/z/),  and  one  "missing"  initial  fricative  (to  account  for 
the  possibility  that  the  listener  perceived  no  initial  consonant). 

The  test  was  administered  automatically.  The  test  procedure  was  as  follows:  The 
nine  choices  were  presented  visually  on  a  computer  CRT  display  approximately  one  second 
before  each  test  token  was  played.  A  row  of  push-buttons  was  displayed  on  the  CRT  below 
the  choices,  with  one  push-button  displayed  for  each  choice.  The  listener  was  told  to  select 
the  choice  that  most  closely  matched  the  test  token  that  was  heard.  This  was  done  by 
selecting  the  push-button  (with  a  mouse)  below  the  choice  that  best  matched  the  test  token. 

The  test  tokens  were  converted  from  a  sampled-data  waveform  to  an  analog  signal 
using  a  16-bit,  Ariel,  Pro-Port  Model  656  digital-to-analog  (D/A)  converter.  The  sampling 
frequency  was  10  kHz.  The  D/A  converter  was  connected  to  an  Ariel  DSP-32C  digital 
signal  processing  board  that  was  installed  in  a  Sun  Microsystems  IPX  workstation.  The 
test  tokens  were  played  over  Sony  MDR-CD888  headphones  at  a  comfortable  listening 
level  in  a  quiet  room. 

The  tokens  were  played  at  a  rate  of  approximately  one  token  every  4.5  seconds. 
In  order  to  prevent  the  listener  from  becoming  fatigued,  a  five  second  pause  was  inserted 
after  every  ten  tokens.  In  addition,  the  listener  was  given  a  three  minute  break  after  every 
90  tokens  (approximately  every  nine  minutes). 
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The  listener  was  told  that  he/she  must  select  one  of  the  nine  choices  for  each  token 
that  was  presented.  The  listener  was  also  told  that  if  he/she  failed  to  select  one  of  the  nine 
choices  for  a  given  token,  it  would  not  affect  the  outcome  of  the  experiment,  although  it 
was  highly  undesirable.  This  was  done  to  prevent  the  listener  firom  being  distracted  if  a 
choice  was  not  made  in  the  allotted  time.  Unbeknownst  to  the  listener,  if  a  choice  was  not 
made  for  a  specific  token,  the  token  was  saved  and  presented  to  the  listener  at  a  later  time 
during  the  same  test.  This  process  was  repeated  for  each  token  that  was  missed  until  a  single 
choice  was  made  for  every  test  token. 

A  copy  of  the  test  instructions  is  given  in  Appendix  B.  Before  the  test,  a  written 
copy  of  the  instructions  was  given  to  the  listener.  The  instructions  were  also  read  aloud 
by  the  test  administrator.  As  part  of  the  test  instructions,  each  of  the  18  possible  word 
choices  was  pronounced  by  the  test  administrator. 

The  listener  was  allowed  to  practice  taking  the  test  with  the  actual  set  of  test  tokens. 
The  practice  was  done  to  familiarize  the  listener  with  the  motor  skills  required  in  the  test 
and  to  allow  the  listener  to  become  familiar  with  the  list  of  possible  answers.  The  practice 
was  done  immediately  preceding  the  actual  test,  and  the  listener  was  told  to  take  as  much 
time  as  was  necessary  to  become  accustomed  to  the  testing  system.  In  general,  it  has  been 
shown  that  practice  is  desirable,  and  helps  to  decrease  variability  in  test  scores  (ANSI  1989; 
IEEE  Subjective  Measurements  Subcommittee,  1969;  Voiers,  1983).  All  of  the  Usteners 
found  the  automated  testing  system  easy  to  use,  and  the  average  practice  time  was  less  than 
five  minutes. 

Note  that  the  listeners  were  not  informed  about  how  the  test  tokens  were  created, 
nor  about  what  the  listening  test  was  measuring  (i.e.  the  effects  of  consonant  duration  upon 
perception).  The  listeners  were  told  only  that  they  were  participating  in  a  test  of  the 
intelligibility  of  a  new  type  of  speech  synthesizer,  and  that  each  of  the  test  tokens  that  they 
heard  was  one  of  the  18  previously  discussed  words.  This  was  done  to  prevent  the  test 
responses  from  being  biased  towards  the  four  original,  unmodified  words. 
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4.3.3  Listeners  and  Listening  Environment 

This  section  discusses  the  listeners  that  participated  in  the  formal  listening  test  It 
details  the  steps  that  were  followed  in  the  selection  of  the  type  and  number  of  listeners,  as 
well  as  the  screening  process  that  was  used. 

4.3.3.1  Type 

Table  4-3  gives  a  summary  of  formal  listening  tests  conducted  by  other  researchers, 
and  lists  the  number  and  type  of  listeners  that  were  used,  as  well  as  details  regarding  the 
listening  environment.  It  can  be  seen  from  this  table  that  when  statistics  concerning  the 
listeners  were  reported,  the  majority  of  the  studies  used  university  students  as  test  subjects. 
The  only  other  attribute  common  to  the  studies  was  that  the  listeners  were  all  reported  to 
have  normal  hearing. 

The  group  of  listeners  for  this  study  was  selected  from  both  faculty  and 
undergraduate  students  from  the  Department  of  Communication  Processes  and  Disorders 
at  the  University  of  Florida.  Each  listener  was  required  to  (1)  be  a  native  speaker  of 
American  English,  (2)  have  no  known  hearing  deficiency,  (3)  have  experience  with  the 
operation  of  a  computer  mouse,  and  (4)  be  willing  to  take  the  test  twice  (once  on  one  day, 
and  once  on  a  second  day).  Each  listener  was  paid  $15. 

4.3.3.2  Number 

It  can  be  seen  from  Table  4—3  that  in  past  studies,  each  researcher  used  a  different 
number  of  listeners.  The  fewest  number  of  listeners  used  was  12,  while  the  largest  number 
of  listeners  was  33.  In  this  study,  we  tested  29  listeners.  Note  that  two  of  the  listeners  were 
subsequently  rejected  by  the  screening  process,  so  the  results  are  presented  for  a  total  of 
27  listeners  (four  male  and  23  female). 


150 


ex 

T3 

1 

e 


o 

o 

'So 
o 

"o 

c 
o 

cx 

s 
o 

■> 

E 
p 


c 
o 


c 
c 

i 

c 


> 
c 
u 

c 
c 


c 

C 


O 

c 

i 

C 


> 

c 
W 

'c 


3 

I 

E 
o 
U 


2 

3 
OS 

c 


I 

a 


s     i^ 


CQ 


1/1 

u 

c 
o 

JS 
Q. 

C4 

U 

J5 


Oh 
-§ 

O 

c 
Q 


1/3    bO  C    bO 

o  -I  §  -I 


u 


HJ 


«j  .2  X)  .2 


o 
3 


1/1 
Qh 

•  1— t 

2  " 


cS 


O 

T3   < 

3     X 
O     ^ 

•-I     _. 
(U    c 

M    O 

o  ^ 

J   CL, 


1 

ex 

> 


1 

<u  ^ 

C    u 
bO  > 

^    bo 

.  c 

S  ^ 


oo 
Q 

Oh 


X) 
I 

o 
U 


bfi 

C 

C 

u 


o 


t/3 

c 
o 


> 

bO 

C 

'2 


S 
o 
u 


Ml 

c 
o 

ex 
u 

ON 


c 

3 

I 

pa 

u 

c 

-J 


i£8 

3  1^ 

12     >     </5 


V. 
o 
ex 


(U  < 

c 
exu-) 

^^ 

3  O 

1/3 


bO 

C 


bo 
.o    C 

4— »     *P* 

C     nj 

=^  >   S 
bb  D  5 


•^< 


c 
o 


ii  =  =« 

u   u   a 

I— ^      »3      C 


feb 


C/3 


ex 

3    3 


(U  .  . 

"S  i2  bfl 

c  c  c 

3  3  2 

'rt  to  C 

°-  >.  g 

OJ  C  t« 


<T)     M 


C 


^     "     3 
— I  .^    bO 

<   3  ;3 


bo 

c 


u 


3 

a 
bo 

c 


u 

Oh 

1/1 

I 

(« 

"bb 

c 


3  .2 


On  ^ 
c<3  £i 


o 


bO  ns    U 


bfi 

c 

C 


2  LJ  «^   9 


cx) 


en 

C/3 

4 

S 

jj 

ca 

X 

bo 

r«3 

•a 

H 

J3Q 

ex 
<u  in 


1/3 

q2: 


cS2 


boON 

o  <^ 


151 


4.3.3.3  Training 

Aside  from  the  short  practice  session  immediately  preceding  the  test,  the  listeners 
were  not  exposed  to  the  time-modified  words  before  the  test  was  conducted.  Although  the 
listeners  were  either  students  or  faculty  in  the  field  of  speech  science,  they  are  still  classified 
as  "untrained"  listeners,  since  they  had  no  significant  prior  exposure  to  time-modified 
speech.  This  agrees  with  the  definitions  of  untrained  listeners  found  in  the  various  studies 
described  in  Section  1.3. 

4.3.3.4  Screening 

Listener  intra-variability,  or  inconsistency,  is  a  source  of  testing  error,  especially 
for  untrained  listeners.  To  account  for  this,  a  simple  screening  process  was  used  to  measure 
inconsistencies  in  the  responses  of  the  individual  listeners.  The  procedure  was  as  follows: 
The  listening  test  was  given  twice  to  each  listener.  The  time  span  between  the  first  and 
second  time  that  the  test  was  given  ranged  from  one  to  seven  days.  Thus,  the  test  was  never 
given  more  than  once  per  day  for  any  given  listener.  In  addition,  the  order  of  presentation 
of  the  tokens  was  randomized  in  both  tests.  The  presentation  order  in  the  first  test  was  not 
the  same  as  the  presentation  order  in  the  second  test,  although  all  270  tokens  were  played 
in  both  of  the  tests. 

Each  listener's  responses  for  the  two  tests  were  then  compared.  For  each  token,  if 
the  answers  firom  the  two  tests  agreed,  a  "match"  was  said  to  occur.  A  list  of  the  percentage 
of  matches  for  each  Ustener  is  given  in  Table  4—4.  The  table  shows  that  the  majority  of  the 
listeners  had  consistency  scores  that  were  clustered  in  the  range  from  69.3%  to  86.7%.  The 
mean  consistency  score  for  the  29  listeners  was  74.5%,  with  a  standard  deviation  of  8.6%. 

Two  listeners  had  scores  that  were  below  those  of  the  main  group  of  listeners.  One 
had  a  score  of  39.6%,  and  the  other  had  a  score  of  60.7%.  Therefore,  the  results  for  these 
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Table  4—4.   Percentage  of  matches  between  two  tests  for  each  listener. 


Listener  ID 

Percent  Matches 

Pass 

Fail 

1 

74.8 

X 

2 

72.2 

X 

3 

69.3 

X 

4 

39.6 

X 

5 

74.1 

X 

6 

73.3 

X 

7 

75.2 

X 

8 

74.1 

X 

9 

82.6 

X 

10 

77.0 

X 

11 

80.0 

X 

12 

82.6 

X 

13 

78.9 

X 

14 

83.0 

X 

15 

80.4 

X 

16 

74.4 

X 

17 

75.9 

X 

18 

71.5 

X 

19 

71.9 

X 

20 

83.0 

X 

21 

75.2 

X 

22 

80.7 

X 

23 

86.7 

X 

24 

76.3 

X 

25 

70.0 

X 

26 

73.3 

X 

27 

71.9 

X 

28 

60.7 

X 

29 

71.9 

X 
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two  listeners  were  not  used.  The  mean  consistency  score  for  the  remaining  27  listeners  was 
76.3%,  with  standard  deviation  of  4.7%. 

4.4  Results  of  the  Formal  Listening  Test 

This  section  describes  the  results  of  the  formal  listening  test.  Each  of  the  27 
listeners  took  the  test  twice.  Therefore,  each  of  the  270  time-modified  tokens  was  tested 
54  times. 

Figures  4-5  through  4-8  show  an  average  of  the  listeners'  responses  for  the 
time-modified  versions  of  "sue,"  "zoo,"  "said,"  and  "zed,"  respectively.  The  resuhs  are 
also  presented  numerically  in  Appendix  C.  In  each  figure,  the  graphs  are  arranged  in  three 
columns.  Each  column  shows  the  test  results  for  tokens  that  were  created  by  preserving 
the  same  portion,  or  position,  of  the  initial  consonant.  For  example,  the  first  column  of 
Figure  4-5  shows  the  test  results  for  the  tokens  that  were  created  by  preserving  various 
lengths  of  the  beginning  portion  of  the  initial  consonant  in  the  word  "sue."  The  second 
column  of  Figure  4-5  shows  the  test  results  for  the  tokens  that  were  created  by  preserving 
various  lengths  of  the  middle  portion  of  the  initial  consonant  in  the  word  "sue."  The  third 
column  of  Figure  4-5  shows  the  test  results  for  the  tokens  that  were  created  by  preserving 
various  lengths  of  the  end  portion  of  the  initial  consonant  in  the  word  "sue." 

The  graphs  are  also  arranged  in  nine  rows.  Each  row  shows  the  average  percentage 
of  listeners  that  selected  the  word  choice  shown  in  the  row  heading  as  the  word  that  most 
closely  matched  the  test  token  that  they  heard. 

For  each  of  the  27  graphs  in  a  given  figure,  the  abscissa  depicts  the  duration  (in  ms) 
of  the  time-modified  initial  consonant.  The  ordinate  depicts  the  average  percentage  of 
listeners  that  selected  the  word  choice  listed  in  the  row  heading.  Since  the  listener  chose 
one  of  nine  possible  answers,  the  level  of  chance  is  11.1%. 

As  an  example  of  how  to  interpret  the  results,  consider  the  graphs  in  Figure  4-5 
(the  reader  is  reminded  that  this  figure  only  shows  the  test  results  for  the  time-modified 
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Figure  4-5.  Summary  of  average  of  responses  for  the  word  "sue."  In  each  graph,  the 
abscissa  displays  duration  in  ms  of  the  initial  /s/,  and  the  ordinate  displays 
percentage  of  responses.  The  row  headings  list  the  word  that  was  perceived.  The 
column  headings  list  the  portion  of  the  initial  consonant  that  was  preserved. 
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Figure  4-6.  Summary  of  average  of  responses  for  the  word  "zoo."  In  each  graph,  the 
abscissa  displays  duration  in  ms  of  the  initial  /z/,  and  the  ordinate  displays 
percentage  of  responses.  The  row  headings  list  the  word  that  was  perceived.  The 
column  headings  list  the  portion  of  the  initial  consonant  that  was  preserved. 
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Figure  4-7.  Summary  of  average  of  responses  for  the  word  "said."  In  each  graph,  the 
abscissa  displays  duration  in  ms  of  the  initial  /s/,  and  the  ordinate  displays 
percentage  of  responses.  The  row  headings  list  the  word  that  was  perceived.  The 
column  headings  list  the  portion  of  the  initial  consonant  that  was  preserved. 
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Figure  4—8.  Summary  of  average  of  responses  for  the  word  "zed."  In  each  graph,  the 
abscissa  displays  duration  in  ms  of  the  initial  /z/,  and  the  ordinate  displays 
percentage  of  responses.  The  row  headings  list  the  word  that  was  perceived.  The 
column  headings  list  the  portion  of  the  initial  consonant  that  was  preserved. 
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variations  of  the  word  "sue").  The  top  row  of  graphs  show  the  average  percentage  of 
listeners,  as  a  function  of  the  duration  of  the  initial  consonant,  that  perceived  the  various 
test  tokens  to  be  the  word  "boo."  In  this  figure,  each  graph  shows  the  results  for  25  test 
tokens.  This  is  because  the  test  tokens  were  created  by  progressively  increasing  the 
duration  of  the  initial  consonant  from  zero  ms  to  the  original,  unmodified  length  in  10  ms 
increments,  and  the  length  of  the  unmodified  consonant  in  the  word  "sue"  was 
approximately  240  ms.  Note  that  in  the  other  figures,  the  number  of  tokens  tested  and 
reported  in  each  of  the  graphs  is  not  necessarily  25.  This  is  because  the  duration  of  the 
unmodified  consonant  for  each  of  the  four  test  words  differed  (see  Table  4-2). 

It  can  be  seen  from  the  left  graph  in  the  top  row  of  Figure  4-5  that  the  word  "boo" 
was  perceived  by  50.0%  of  the  listeners  when  the  beginning  10  ms  of  the  /s/  in  the  word 
"sue"  was  preserved.  When  zero  ms  of  the  beginning  portion  of  the  /s/  was  preserved  (i.e. 
the  consonant  was  completely  eliminated),  35.2%  of  the  listeners  heard  the  word  "boo." 
When  20  ms  of  the  beginning  portion  of  the  /s/  was  preserved,  35.2%  of  the  listeners  heard 
the  word  "boo."  When  60  ms  or  more  of  the  beginning  portion  of  the  /s/  was  preserved, 
none  of  the  listeners  heard  the  word  "boo."  Thus,  the  word  "boo"  was  perceived  when  a 
relatively  short  portion  of  the  beginning  of  the  initial  consonant  was  preserved. 

Examination  of  the  middle  graph  in  the  top  row  of  Figure  4-5  shows  tiiat  for  the 
test  tokens  tiiat  preserved  the  middle  10  ms  of  the  /s/  in  die  word  "sue,"  87.0%  of  the 
listeners  perceived  the  word  "doo."  This  is  contrasted  by  tiie  result  that  50.0%  of  the 
listeners  heard  the  word  "boo"  when  the  beginning  10  ms  of  the  /s/  was  preserved.  While 
die  majority  of  listeners  heard  a  voiced  stop  in  both  of  these  examples,  the  identity  of  the 
stop  was  different.  This  result  supports  the  hypothesis  that  the  position  of  the  portion  of 
the  phoneme  that  is  preserved  has  a  significant  role  in  perception. 

In  general,  the  words  that  were  perceived  by  the  listeners  varied,  and  depended 
upon  the  original  word,  die  time  modification  mapping  method  (which  determined 
position),  and  the  duration  of  the  time-modified  consonant.   Due  to  these  variations  in 
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perception,  the  following  sections  describe  the  results  according  to  the  original  word  that 
was  used  to  create  the  time-modified  tokens. 

The  results  are  described  in  terms  of  the  two  sets  of  possible  answers  that  are 
described  in  Section  4.3.2.  Each  answer  set  contained  nine  words  that  differed  only  in  the 
initial  consonant.  Therefore,  the  results  are  often  presented  in  terms  of  the  initial  consonant 
that  was  perceived,  instead  of  the  word  that  was  perceived,  since  the  initial  consonant  was 
the  only  feature  that  differed  among  the  words  in  a  given  set  of  answers. 

4.4. 1  Perception  of  the  Time-Modified  Variations  of  the  Word  "Sue" 

Figure  4-5  shows  the  test  results  for  the  75  time-modified  versions  of  the  word 
"sue."  The  results  for  the  25  tokens  that  preserved  the  beginning  portions  of  the  /s/  are 
shown  in  the  first  column.  When  the  /s/  was  completely  removed,  55.6%  of  die  hsteners 
heard  no  initial  consonant  (the  word  "ooh"),  and  35.2%  heard  /b/.  The  listeners  perceived 
the  consonant  most  often  as  /b/  for  a  duration  of  10  ms,  as  missing  (the  word  "ooh")  for 
durations  in  the  range  from  20  to  40  ms,  as  161  for  a  duration  of  50  ms,  as  Izl  for  durations 
in  the  range  from  60  to  100  ms,  and  as  /s/  for  durations  of  110  ms  or  greater. 

The  results  for  the  25  tokens  that  preserved  the  middle  portions  of  the  /s/  are  shown 
in  the  second  column  in  Figure  4-5.  When  the  /s/  was  completely  removed,  63.0%  of  the 
listeners  perceived  the  initial  consonant  as  missing,  and  27.8%  perceived  /b/.  The  hsteners 
perceived  the  consonant  most  often  as  /d/  for  durations  in  the  range  fix)m  10  to  40  ms,  as 
Ixj  for  durations  in  the  range  fi-om  50  to  90  ms,  and  as  /s/  for  durations  of  100  ms  or  greater. 

The  results  for  the  25  tokens  that  preserved  the  end  portions  of  the  /s/  are  shown 
in  the  third  column  in  Figure  4-5.  When  the  /s/  was  completely  removed,  57.4%  of  the 
listeners  perceived  no  initial  consonant  (tiie  word  "ooh"),  and  31.5%  perceived  /b/.  The 
listeners  perceived  the  consonant  most  often  as  /b/  for  a  duration  of  10  ms,  as  either /b/  or 
/d/  (the  scores  were  tied)  for  a  duration  of  20  ms,  as  /d/  for  durations  of  30  and  40  ms,  as 
N  for  durations  in  the  range  from  50  to  1 30  ms,  and  as  /s/  for  durations  of  140  ms  or  greater. 
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There  was  considerable  disagreement  among  the  listeners,  as  shown  by  the  fact  that 
the  largest  single  group  of  listeners  who  were  in  agreement  often  comprised  less  than  75% 
of  the  total  listening  group.  In  the  first  column,  at  very  short  durations  the  disagreement 
was  between  no  initial  consonant  and  /b/  or  161.  There  was  also  disagreement  between  /s/ 
and  M  for  durations  in  the  range  fipom  80  to  130  ms.  In  the  second  column,  the 
disagreement  was  between  /t/  and  /d/  for  durations  in  the  range  from  30  to  60  ms,  and 
between  M  and  M  for  durations  in  the  range  from  60  to  90  ms.  In  the  third  column,  the 
disagreement  was  between  no  initial  consonant  and  Pol  at  very  short  durations,  and  between 
M  and  /d/  for  durations  in  the  range  from  30  to  60  ms.  There  was  also  disagreement  between 
hi  and  IxJ  for  durations  in  the  range  from  110  to  150  ms. 

4.4.2  Perception  of  the  Time-Modified  Variations  of  the  Word  "Zoo" 

Figure  4-6  shows  the  test  results  for  the  75  time-modified  versions  of  the  word 
"zoo."  The  results  for  the  25  tokens  that  preserved  the  beginning  portions  of  the  M  are 
shown  in  the  first  column.  When  the  M  was  completely  removed,  66.7%  of  the  listeners 
heard  no  initial  consonant  (the  word  "ooh"),  and  13.0%  heard  /I/.  The  listeners  perceived 
the  consonant  most  often  as  missing  for  durations  in  the  range  from  10  to  30  ms,  as  /I/  for 
durations  in  the  range  from  40  to  160  ms,  and  as  /z/  for  durations  of  170  ms  or  greater. 

The  results  for  the  25  tokens  that  preserved  the  middle  portions  of  the  /z/  are  shown 
in  the  second  column  of  Figiu-e  4-6.  When  the  /z/  was  completely  removed,  70.4%  of  the 
listeners  heard  no  initial  consonant  (the  word  "ooh"),  and  13.0%  heard  /I/.  The  listeners 
perceived  the  consonant  most  often  as  /I/  for  durations  in  the  range  from  10  to  120  ms, 
although  /z/  was  actually  perceived  most  often  at  50,  80,  and  1 10  ms.  The  consonant  M 
was  perceived  for  durations  of  130  ms  or  greater. 

The  results  for  the  25  tokens  that  preserved  the  end  portions  of  the  IzJ  are  shown 
in  the  third  column  of  Figure  4-6.  When  the  /z/  was  completely  removed,  63.0%  of  the 
listeners  heard  no  initial  consonant  (the  word  "ooh"),  and  14.8%  heard  /I/.  The  Usteners 
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perceived  the  consonant  most  often  as  missing  for  durations  in  the  range  from  10  to  40  ms, 
as  /d/  for  a  duration  of  50  ms,  as  /!/  for  a  duration  of  60  ms,  and  as  /z/  for  durations  of 
70  ms  or  greater. 

Note  that  in  all  three  of  the  columns,  there  was  considerable  disagreement  among 
the  listeners  for  almost  all  durations.  This  is  shown  by  the  fact  that  except  for  the  answer 
"zoo,"  the  average  perception  for  any  one  word  seldom  rose  above  75%.  In  the  first 
column,  the  disagreement  occurred  mainly  between  no  initial  consonant  and  /I/  for 
durations  in  the  range  from  zero  to  60  ms,  and  between  /I/  and  /z/  for  durations  in  the  range 
from  130  to  240  ms.  In  the  second  column,  the  disagreement  occurred  between  no  initial 
consonant  and  /I/  or  /d/  for  durations  in  the  range  from  zero  to  30  ms,  and  between  /I/  and 
Izl  for  durations  in  the  range  from  50  to  190  ms.  In  the  third  column,  the  disagreement 
occurred  between  no  initial  consonant  and  /I/  for  durations  in  the  range  from  zero  to  40  ms, 
and  between  /I/  and  M  for  durations  in  the  range  from  50  to  190  ms. 

4.4.3  Perception  of  the  Time-Modified  Variations  of  the  Word  "Said" 

Figure  4-7  shows  the  test  results  for  the  63  time-modified  versions  of  the  word 
"said."  The  results  for  the  21  tokens  that  preserved  the  beginning  portions  of  the  /s/  are 
shown  in  the  first  column.  When  the  /s/  was  completely  removed,  51.9%  of  the  listeners 
heard  161,  and  29.6%  heard  no  initial  consonant  (the  word  "ed").  The  listeners  perceived 
the  consonant  most  often  as  161  for  durations  in  the  range  from  10  to  50  ms  (altiiough  the 
scores  were  tied  for  161  and  M  at  40  ms),  and  as  /s/  for  durations  of  60  ms  or  greater. 

The  results  for  the  2 1  tokens  that  preserved  the  middle  portions  of  the  /s/  are  shown 
in  the  second  column  of  Figure  4-7.  When  the  /s/  was  completely  removed,  46.3%  of  the 
listeners  heard  161,  and  25.9%  heard  no  initial  consonant.  The  listeners  perceived  the 
consonant  most  often  as  161  for  durations  in  the  range  from  10  to  30  ms,  as  /z/  for  a  duration 
of  40  ms,  and  as  /s/  for  durations  of  50  ms  or  greater. 
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The  results  for  the  21  tokens  that  preserved  the  end  portions  of  the  /s/  are  shown 
in  the  third  column  of  Figure  4-7.  When  the  /s/  was  completely  removed,  44.4%  of  the 
listeners  heard  161  as  the  initial  consonant,  and  40.7%  heard  no  initial  consonant  The 
listeners  perceived  the  consonant  most  often  as  161  for  durations  in  the  range  from  10  to 
30  ms,  as  N  for  durations  in  the  range  from  40  to  60  ms,  and  as  /s/  for  durations  of  70  ms 
or  greater. 

The  was  some  disagreement  among  the  listeners,  especially  for  durations  that  were 
less  than  90  ms.  In  the  first  column,  the  disagreement  was  predominantiy  between  no  initial 
consonant  and  161  in  the  range  from  zero  to  30  ms.  There  was  also  some  disagreement 
between  IzJ  and  161  in  the  range  from  30  to  70  ms.  In  the  second  column,  the  disagreement 
was  between  161  and  /z/  in  the  range  from  30  to  60  ms.  There  was  also  disagreement 
between  /s/  and  /z/  in  the  range  from  40  to  90  ms.  In  the  third  column,  there  was 
disagreement  between  no  initial  consonant  and  161  in  the  range  from  zero  to  20  ms,  and 
between  /s/,  N,  161,  and  M  in  the  range  from  30  to  90  ms. 

4.4.4  Perception  of  the  Time-Modified  Variations  of  the  Word  "Zed" 

Figure  4—8  shows  the  test  results  for  the  57  time-modified  versions  of  the  word 
"zed."  The  results  for  the  19  tokens  that  preserved  the  beginning  portions  of  the  M  are 
shown  in  the  the  first  column.  When  the /z/ was  completely  removed,  42.6%  of  the  listeners 
heard  161  as  the  initial  consonant,  and  24.1%  heard  no  initial  consonant  (the  word  "ed"). 
The  listeners  perceived  the  consonant  most  often  as  161  for  durations  in  the  range  from  10 
to  50  ms,  and  as  /z/  for  durations  of  60  ms  or  greater. 

The  results  for  the  1 9  tokens  that  preserved  the  middle  portions  of  the  /z/  are  shown 
in  die  second  column  in  Figure  4-8.  When  the  IzJ  was  completely  removed,  38.9%  of  the 
listeners  heard  161,  and  25.9%  heard  no  initial  consonant.  The  listeners  perceived  the 
consonant  most  often  as  /d/  for  durations  in  the  range  from  10  to  30  ms,  and  as  M  for 
durations  of  40  ms  or  greater. 
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The  results  for  the  19  tokens  that  preserved  the  end  portions  of  the  Izl  are  shown 
in  the  third  column  in  Figure  4—8.  When  the  /z/  was  completely  removed,  44.4%  of  the 
listeners  heard  161  as  the  initial  consonant,  and  24.1%  heard  no  initial  consonant.  The 
listeners  perceived  the  consonant  most  often  as  IB  I  for  a  duration  of  10  ms,  as  IzJ  for  a 
duration  of  20  ms,  as  161  for  a  duration  of  30  ms,  and  as  Izl  for  durations  of  40  ms  or  greater. 

There  were  only  a  few  areas  of  disagreement  among  the  listeners,  and  these  were 
primarily  at  the  shorter  consonant  durations.  In  the  first  column,  the  disagreement  was 
between  no  initial  consonant,  161  and  /I/  for  durations  in  the  range  from  zero  to  30  ms.  There 
was  also  disagreement  between  161  and  Izl  for  durations  in  the  range  from  40  to  110  ms. 
In  the  second  column,  there  was  littie  disagreement,  except  between  no  initial  consonant, 
/I/,  161,  and  Izl  at  a  duration  of  zero  ms.  In  the  third  column,  there  was  also  disagreement 
between  no  initial  consonant,  /I/,  161,  and  M  at  a  duration  of  zero  ms.  There  was  also 
disagreement  between  161  and  M  for  durations  in  the  range  from  10  to  50  ms. 

4.4.5  Summary  of  Answers  Selected  Most  Often 

Figure  4-9  shows,  as  a  function  of  the  consonant  duration,  the  answers  that  were 
selected  most  often  by  the  27  listeners  for  the  time-modified  tokens  that  were  created  from 
the  words  "sue"  and  "said."  Similarly,  Figure  4-10  shows  the  answers  that  were  selected 
most  often  by  the  27  listeners  for  the  time-modified  tokens  that  were  created  from  the 
words  "zoo"  and  "zed."  In  each  figure,  the  three  horizontal  bars  depict  the  results  for  the 
tokens  that  were  modified  by  preserving  the  beginning,  middle,  and  end  portions  of  the 
initial  consonant,  respectively.  The  scale  at  the  bottom  of  each  set  of  bar  graphs  shows  the 
duration  of  the  initial  consonant. 

As  an  example  of  how  to  interpret  the  figures,  consider  the  top  horizontal  bar  in 
Figure  4— 9a.  This  shows  the  answers  that  were  selected  most  often  for  the  25  tokens  that 
were  created  by  preserving  various  portions  of  the  beginning  of  the  /s/  in  the  word  "sue." 
For  a  consonant  duration  of  zero  ms,  no  initial  consonant  (denoted  by  an  asterisk)  was 
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Figure  4-9.  Summary  of  the  initial  consonants  that  were  perceived  most  often  in  the 
formal  listening  test.  The  abscissa  denotes  the  duration  of  the  initial  consonant  in 
ms,  and  the  row  headings  denote  the  portion  of  the  consonant  that  was  preserved. 

a)  Results  for  tokens  created  from  the  word  "sue;" 

b)  Results  for  tokens  created  from  the  word  "said." 
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Figure  4-10.  Summary  of  the  initial  consonants  that  were  perceived  most  often  in  the 
formal  listening  test.  The  abscissa  denotes  the  duration  of  the  initial  consonant  in 
ms,  and  the  row  headings  denote  the  portion  of  the  consonant  that  was  preserved. 

a)  Results  for  tokens  created  from  the  word  "zoo;" 

b)  Results  for  tokens  created  from  the  word  "zed." 
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perceived  most  often.  Continuing  firom  left  to  right,  the  answers  that  wwe  selected  most 
often  were  /b/  for  a  duration  of  10  ms,  no  initial  consonant  for  durations  in  the  range  from 
20  to  40  ms,  161  at  a  duration  of  50  ms,  /z/  for  durations  in  the  range  from  60  to  100  ms, 
and  /s/  for  duration  of  110  ms  or  greater. 

The  reader  is  reminded  that  although  these  figures  depict  the  choices  that  were 
selected  most  often  for  each  token,  the  figures  do  not  show  xhQ  percentage  of  the  listeners 
that  selected  the  "most  popular"  choice.  This  is  important  because  in  many  instances,  and 
especially  at  the  shorter  durations,  the  listening  group  was  divided  in  their  answers.  The 
actual  division  of  listeners  is  depicted  in  Figures  4-5  Uirough  4-8. 

4.5  Summary 

This  chapter  described  both  the  pilot  studies  and  the  formal  listening  tests  that  were 
performed.  The  pilot  studies  used  single- syllable  CV  and  CVC  words  from  tiie  DRT 
spoken  by  one  female  and  two  male  speakers.  Single-syllable  words  were  chosen  because 
they  contained  a  relatively  small  number  of  phonemic  segments,  and  were  therefore  easier 
to  modify  and  test  in  a  controlled  manner  than  were  multi-syllable  words.  Single-syllable 
words  were  also  chosen  in  order  to  eliminate  contextual  information. 

The  pilot  studies  examined  the  effects  of  removing  various  portions  of  the  initial 
phoneme.  Particular  attention  was  paid  to  the  quality  of  the  resulting  time-modified 
speech.  In  all  instances,  the  listeners  reported  that  the  synthesized  speech  was 
indistinguishable  finom  the  time-modified  speech,  insofar  as  the  quality  was  concerned. 

The  pilot  studies  were  grouped  according  to  the  speech  segment  categories  that 
were  investigated.  Tests  that  removed  various  portions  of  the  initial  nasals  in  CV  and  CVC 
words  showed  that  a  relatively  large  portion  of  the  nasal  could  be  removed  before 
perception  was  significantiy  affected.  Conversely,  tests  that  removed  various  portions  of 
the  initial  stops  in  CV  and  CVC  words  showed  that  correct  perception  of  the  stop  decreased 
relatively  quickly  as  the  duration  was  reduced.  In  addition,  it  was  observed  that  perception 
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of  the  time-modified  stops  was  dependent  upon  the  relative  position  of  the  portion  of  the 
stop  that  was  removed.  For  example,  perception  of  the  consonant  in  a  CV  word  in  which 
the  initial  50  ms  of  the  stop  was  preserved  often  differed  from  perception  of  the  consonant 
in  a  word  in  which  the  final  50  ms  of  the  stop  was  preserved.  Tests  that  removed  various 
portions  of  the  initial  fricatives  in  CV  and  CVC  words  showed  that  as  the  duration  was 
decreased,  voiced  fricatives  were  more  resistant  to  changes  in  perception  than  were 
unvoiced  fricatives. 

The  formal  listening  test  was  conducted  to  measure  the  effects  of  removing 
segments  of  various  durations  from  three  different  portions  in  the  initial  fricative  in  the 
words  "sue,"  "zoo,"  "said,"  and  "zed."  Multiple  tokens  were  created  from  each  word  by 
successively  incrementing  the  time-modified  duration  of  the  initial  consonant  in  10  ms 
intervals  over  the  range  from  0%  to  100%  of  the  unmodified  duration.  For  example,  for 
the  word  "sue,"  test  words,  or  tokens,  were  created  with  an  initial  consonant  duration  from 
the  set  {  0  ms,  10  ms,  20  ms,  30  ms, ...,  230  ms,  240  ms}.  In  addition,  for  each  duration, 
three  tokens  were  created:  The  first  preserved  the  beginning  portion  of  the  consonant,  the 
second  preserved  the  middle  portion  of  the  consonant,  and  the  third  preserved  the  end 
portion  of  the  consonant.  A  total  of  270  tokens  was  created  and  tested.  The  vowel  and  the 
following  consonant  (for  the  CVC  words)  were  not  modified. 

The  test  procedure  was  automated  and  administered  using  a  Sun  UNIX 
workstation.  The  tokens  were  presented  via  headphones  at  a  rate  of  approximately  one 
token  every  4.5  seconds.  A  list  of  nine  word  choices  that  differed  only  in  the  initial 
consonant  was  displayed  on  the  CRT  display,  and  a  push-button  was  displayed  below  each 
of  the  nine  words.  The  listener  used  the  computer  mouse  to  select  the  word  choice  that  most 
closely  matched  each  test  token  that  was  heard.  The  test  took  approximately  25  minutes 
to  complete. 

A  total  of  29  listeners  took  the  test.  Each  listener  took  the  test  twice —  once  on  one 
day  and  once  on  a  second  day.  This  was  done  to  screen  the  listeners  for  inconsistencies  in 
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their  responses.  Two  of  the  listeners  were  discarded  by  this  screening  process.  Therefore, 
the  results  were  presented  for  a  total  of  27  listeners  (four  male  and  23  female). 

The  results  were  presented  for  each  of  the  four  words  in  terms  of  the  percentage  of 
listeners  who  heard  a  particular  answer  as  a  function  of  both  the  duration  of  the 
time- modified  token,  and  the  modification  method  (i.e.  which  portion  of  the  consonant  was 
preserved).  In  general,  the  results  showed  that  the  specific  initial  consonant  that  was 
perceived  depended  upon  the  duration  as  well  as  the  position  of  the  time-modified 
consonant. 
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CHAPTERS 
DISCUSSION  OF  THE  FORMAL  LISTENING  TEST  RESULTS 


This  chapter  discusses  the  results  of  the  formal  listening  test.  The  goal  of  the 
chapter  is  to  determine  and  explain  the  acoustic  featiu"es  present  in  the  time-modified  test 
tokens  that  caused  the  perception  of  various  initial  consonants.  A  complete  description  of 
the  formal  listening  test  is  given  in  Chapter  4.  The  results  are  presented  graphically  in 
Figures  4—5  through  4—10,  and  in  numerical  form  in  Appendix  C. 

This  chapter  is  organized  as  follows:  The  first  section  discusses  the  perception  of 
the  time-modified  /s/  in  the  words  "sue"  and  "said,"  and  the  second  section  discusses  the 
perception  of  the  time-modified/?/ in  the  words  "zoo"  and  "zed."  The  results  are  discussed 
separately  for  each  of  the  words,  and  then  compared  for  each  word  pair.  The  discussions 
for  each  word  are  also  divided  according  to  the  portion,  or  position,  of  the  initial  consonant 
that  was  preserved. 

5.1  Perception  of  the  Time-Modified  As/ 

This  section  discusses  the  results  for  the  tokens  that  were  created  by  modifying  both 
the  duration  and  the  position  of  the  portion  of  the  /s/  that  was  preserved  in  the  CV  word 
"sue"  and  the  CVC  word  "said."  The  results  for  the  two  words  are  discussed  separately, 
and  then  compared. 

5.1.1   The  Word  "Sue" 

This  section  discusses  the  results  for  the  75  tokens  that  were  created  from  the  word 
"sue."  In  general,  the  results  showed  that  both  duration  and  p>osition  affected  the  perception 
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of  the  time-modified  /s/.  The  results  are  presented  separately  for  the  three  sets  of  tokens 
that  were  created  by  preserving  the  beginning,  middle,  and  end  of  the  original  consonant. 

5.1.1.1  Tokens  that  preserved  the  beginning  of  the  /s/ 

This  section  discusses  the  results  for  the  25  tokens  that  preserved  the  beginning 
portions  of  the  /s/  in  the  word  "sue."  For  consonant  durations  less  than  50  ms,  listeners 
reported  the  initial  phoneme  as  either  missing  or  as  the  voiced  stop  /b/.  Although  the 
missing  fricative  was  heard  most  often,  Pol  was  heard  almost  as  many  times,  especially  for 
durations  less  than  30  ms.  The  result  that  a  stop  was  heard  suggests  that  many  of  the 
listeners  perceived  the  short  duration  of  the  /s/  as  turbulent  noise  associated  with  the  release 
of  a  stop.  The  finding  that  a  voiced  stop  was  heard  in  this  range  of  durations  is  consistent 
with  the  results  of  Lisker  and  Abramson  (1964).  They  showed  that  a  stop  was  perceived 
as  voiced  if  the  time  between  the  release  and  the  voicing  of  the  following  vowel  (known 
as  the  voicing  onset  time,  or  VOT)  was  less  than  25  ms  for  the  labial  stops  (/p/  and  /b/), 
or  35  ms  for  the  alveolar  stops  (/t/  and  /d/). 

The  specific  stop  /b/  was  heard  due  to  the  listeners'  perception  of  the  place  of 
articulation.  Place  of  articulation  for  stops  is  determined  by  two  separate  cues:  the  center 
ft-equency  of  the  burst  preceding  the  vowel,  and  the  movement  of  the  second  formant 
frequency  at  the  consonant-vowel  boundary  (Delattre  et  al.,  1955;  Halle  et  al.,  1957; 
Kewley-Port,  1982).  Intuitively,  the  stop  that  was  perceived  should  have  had 
approximately  the  same  place  of  articulation  as  the  original,  unmodified  consonant,  since 
the  former  was  derived  from  the  latter.  This,  however,  was  not  the  case,  since  /b/  is  a  labial, 
and  the  original  consonant  /s/  is  an  alveolar. 

The  explanation  for  this  shift  in  the  perceived  place  of  articulation  is  based  upon 
the  frequency  content  ofthe  portion  ofthe  consonant  that  was  preserved.  Figure  5-1  shows 
both  the  time-domain  waveform  and  the  corresponding  spectrogram  for  a  portion  of  the 
original,  unmodified  word  "sue."  The  portion  that  is  shown  is  the  consonant  /s/  and  the 
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Figure  5-1. 


Original  word  "sue."  Only  the  consonant  and  a  portion  of  the  vowel  are 
shown.  The  beginning  and  end  of  the  consonant  (based  on  the  automatic 
segmentation  and  labeling  results)  are  shown  by  arrows  above  each  graph. 

a)  Time-domain  waveform; 

b)  Spectrogram. 
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transition  to  the  following  vowel  /u/.  The  beginning  and  end  points  of  the  consonant,  as 
determined  by  the  automatic  segmentation  and  labeling  programs  described  in  Chapter  2, 
are  marked  with  arrows  at  the  top  of  each  graph. 

Note  that  the  spectrogram  in  this  figure  (and  in  all  of  the  following  figtires)  was 
created  using  a  Hamming  window  with  a  duration  of  8.0  ms.  As  a  result,  the  frequency 
resolution  was  125  Hz.  The  analysis  window  was  advanced  2.0  ms  after  each  short-term 
Fourier  transform,  which  corresponded  to  a  window  overlap  of  75%.  These  analysis 
parameters,  as  well  as  the  fact  that  a  relatively  short  portion  of  the  speech  signal  (350  ms) 
was  depicted,  accounted  for  the  "coarseness"  of  the  spectrograms. 

It  can  be  seen  from  the  spectrogram  in  Figure  5-1  b  that  compared  to  the  entire 
consonant,  the  beginning  20  to  30  ms  of  the  /s/  had  relatively  little  energy  above  3  kHz. 
Since  one  of  the  cues  for  the  place  of  articulation  is  the  center  frequency  of  the  noise  burst, 
the  labial  /b/  was  perceived  since  the  energy  in  the  first  30  ms  of  the  /s/  was  centered  at 
approximately  1.5  kHz  (Cooper  et  al.,  1952;  Halle  et  al.,  1957;  Liberman  et  al.,  1952; 
Stevens  and  Blumstein,  1978).  In  order  for  an  alveolar /d/  to  have  been  perceived,  the  noise 
would  have  had  to  have  been  centered  at  a  much  higher  frequency. 

At  a  duration  of  50  ms,  there  was  a  peak  in  the  number  of  listeners  who  reported 
hearing  the  unvoiced  fricative  161.  A  significant  percentage  of  the  listeners  also  heard  no 
initial  consonant.  This  indicates  that  for  this  duration,  the  time-modified  /s/  was  not 
perceived  as  turbulent  noise  associated  with  a  stop.  This  is  most  likely  due  to  the  fact  that 
the  energy  contour  as  a  function  of  time  of  the  /s/  was  increasing,  instead  of  decreasing  as 
it  would  diuing  a  stop  consonant  release  (Stevens,  1980).  Also,  since  the  overall  energy 
of  the  /s/  was  relatively  small,  the  consonant  was  perceived  as  either  missing  altogether, 
or  as  a  weak,  unvoiced  fricative  (Behrens  and  Blumstein,  1988). 

The  particular  fricative  161  was  heard  due  to  the  perceived  place  of  articulation.  The 
fricative  161  is  a  linguadental  fiicative,  and  is  produced  by  creating  a  constriction  with  the 
tip  of  the  tongue  against  the  upper  incisors.  The  original  consonant  /s/  is  alveolar,  and  is 
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produced  by  creating  a  constriction  slightiy  behind  tiie  incisors,  along  the  alveolar  ridge 
(Heinz  and  Stevens,  1961).  Thus,  the  place  of  articulation  is  similar  for  the  two. 

For  durations  in  the  range  from  60  to  100  ms,  the  time-modified  /s/  was  perceived 
by  a  majority  of  the  listeners  as  the  voiced  fricative  /z/.  As  discussed  earlier,  the  initial 
consonant  was  perceived  as  a  fricative  instead  of  a  stop  due  to  the  energy  contour  that 
increased  over  time.  In  terms  of  place  of  articulation,  /z/v/as  perceived  over  the  range  from 
60  to  100  ms  since  both  /z/  and  /s/  are  alveolar. 

For  durations  of  1 10  ms  or  greater,  the  listeners  perceived  the  initial  consonant  as 
/s/.  These  results  agree  with  those  of  Cole  and  Cooper  (1976),  who  showed  that  the 
perceived  voicing  of  a  word-initial  fricative  was  dependent  upon  the  duration  of  the 
fricative.  They  also  showed  that  the  low  frequency  energy  of  voicing  typically  associated 
with  a  voiced  fricative  "does  not  provide  a  necessary  cue  for  the  voiced-voiceless 
distinction  in  fricatives."  Thus,  the  results  of  this  study  support  their  finding  in  that  it  is 
possible  to  perceive  a  voiced  fricative  without  the  presence  of  voicing. 

The  minimum  consonant  duration  required  for  consistent  perception  of  /s/  is 
denoted  in  this  study  as  the  "threshold"  for  perception  of  the  original  consonant.  This 
means  that  for  all  consonant  durations  equal  to  or  greater  than  the  threshold,  the  listeners 
heard  /s/  most  often.  For  this  set  of  tokens,  the  threshold  was  110  ms.  In  terms  of  absolute 
durations,  this  value  agrees  closely  with  the  value  of  100  ms  reported  in  similar 
experiments  by  Cole  and  Cooper  (1975)  as  the  duration  that  marked  the  boundary  between 
fzj  and  /s/.  However,  it  is  much  larger  than  that  observed  in  a  similar  study  by  Jongman 
(1989),  who  reported  that  approximately  50  ms  of  fiication  was  required  to  perceive  /s/ 
with  "reasonable  accuracy." 

5.1.1.2  Tokens  that  preserved  the  middle  of  the  /s/ 

This  section  discusses  the  results  for  the  25  tokens  that  preserved  the  middle 
portions  of  the  /s/  in  the  word  "sue."  When  compared  to  the  results  of  the  previous  section. 
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the  results  were  similar  for  consonant  durations  greater  than  about  100  ms,  but  varied  for 
shorter  durations.  This  result  supports  the  hypothesis  that  the  position  of  the  portion  of  the 
phoneme  that  is  removed  affects  perception. 

For  durations  in  the  range  from  10  to  40  ms,  a  large  majority  of  the  listeners  heard 
the  voiced  stop  /d/  as  the  initial  consonant.  This  differed  in  several  ways  from  the  results 
for  the  tokens  that  preserved  the  beginning  portions  of  the  /s/:  First,  few  of  the  listeners 
perceived  the  initial  consonant  as  missing  for  durations  of  10  ms  or  greater.  Second,  while 
both  of  the  modification  methods  resulted  in  perception  of  a  voiced  stop,  the  perception 
was  much  stronger  for  the  tokens  that  preserved  the  middle  portion  of  the  original 
consonant  than  for  the  tokens  that  preserved  the  beginning  portion.  The  perception  of  a 
stop  also  occuired  at  a  slighdy  higher  and  longer  range  of  durations  for  the  tokens  that 
preserved  the  middle  portions  of  the  /s/.  In  addition,  the  identity  of  the  voiced  stop  differed 
between  the  two  sets  of  tokens. 

The  reasons  for  these  differences  are  as  follows:  Each  of  the  tokens  in  this  set  had 
a  relatively  abrupt  rise  in  the  energy  contour  at  the  onset  of  the  time-modified  /s/.  An 
example  of  this  is  shown  in  Figiu-e  4— 3c.  This  sudden  onset  created  a  perceptual  cue  for 
the  presence  of  a  stop,  since  it  closely  approximated  the  sudden  release  of  energy,  or 
"burst,"  normally  exhibited  by  stops.  Since  the  duration  of  the  consonant  was  40  ms  or  less, 
the  stop  was  perceived  as  unvoiced  (Liberman  et  al.,  1958;  Lisker  and  Abramson,  1964). 

In  terms  of  the  place  of  articulation,  perception  of  the  particular  voiced  stop  /d/  was 
related  to  the  center  frequency  of  the  noise  burst,  as  described  in  detail  in  the  previous 
section.  Since  the  turbulent  noise  at  the  middle  of  the  unmodified  /s/  was  concentrated 
primarily  above  3  kHz,  the  alveolar  /d/  was  perceived  instead  of  the  labial  /b/  or  the  velar 

/g/. 

For  durations  between  50  and  90  ms,  almost  50%  of  the  listeners  reported  hearing 

the  unvoiced  stop  /t/.  As  discussed  previously,  perception  of  this  stop  is  attributed,  in  part, 
to  the  sudden  onset  of  the  time-modified  /s/.  The  stop  was  perceived  as  unvoiced  due  to 
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the  duration  of  the  frication,  which  was  greater  than  35  ms.  Also,  since  the  majority  of  the 
consonant's  energy  was  centered  in  the  frequency  range  above  3  kHz,  the  alveolar  /t/  was 
perceived. 

Note  however,  that  for  this  same  range  of  durations,  approximately  25%  of  the 
listeners  heard  the  voiced  fricative  /z/  instead  of  the  /t/.  This  result  is  similar  to  that 
observed  for  the  tokens  that  preserved  the  beginning  portions  of  the  /s/.  This  indicates  that 
for  approximately  one  in  every  four  of  the  listeners,  the  energy  contour  of  the 
time-modified  consonant  did  not  signal  the  release  of  a  stop. 

The  threshold  for  consistent  perception  of /s/  was  observed  to  be  100  ms.  This  value 
was  only  10  ms  shorter  than  the  threshold  for  the  tokens  that  modified  the  beginning  portion 
of  the  consonant. 

5.1.1.3  Tokens  that  preserved  the  end  of  the  /s/ 

This  section  discusses  the  results  for  the  25  tokens  that  preserved  the  final  portions 
of  the  /s/  in  the  word  "sue."  The  results  varied  from  the  results  of  the  previous  two  sections, 
although  the  same  general  trend  was  observed  in  ternis  of  the  sequence  of  consonants  that 
was  perceived  as  the  duration  was  increased. 

The  listeners  heard  the  voiced  stop  /b/  for  durations  of  1 0  and  20  ms,  and  the  voiced 
stop  /d/  for  durations  of  30  and  40  ms.  The  reason  for  the  perception  of  a  voiced  stop,  in 
general,  was  the  same  as  was  given  previously,  namely,  that  there  was  an  abrupt  onset  in 
the  /s/  that  resembled  the  sudden  release  of  energy  typically  exhibited  by  a  stop.  Also,  since 
the  consonant  duration  was  less  than  or  equal  to  40  ms,  the  stop  was  perceived  as  voiced. 

In  terms  of  place  of  articulation,  the  explanation  of  the  result  that  two  different  stops 
were  perceived  at  slightiy  different  durations  is  based  upon  the  explanation  for  the 
perception  of  the  place  of  articulation  that  was  given  in  the  previous  two  sections. 
Examination  of  Figure  5-1  b  shows  that  the  final  20  to  30  ms  of  the  unmodified  consonant 
/s/  contained  Uttie  energy  above  2.5  kHz.  Thus,  tokens  created  with  this  portion  of  the 
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signal  were  perceived  as/b/,  since /b/ is  characterized  by  a  lower-frequency  noise  burst  than 
IxJ.  Looking  "backward"  from  the  end  of  the  consonant  in  Figure  5-1  b,  the  high  frequency 
content  (above  3  kHz)  of  the  final  portion  of  the  /s/  increased  rapidly  for  durations  greater 
than  about  30  ms.  Thus,  tokens  created  with  this  portion  of  the  /s/  were  perceived  as  /d/, 
since  /d/  is  characterized  by  a  higher-frequency  noise  burst  than  /b/. 

A  large  majority  of  the  listeners  perceived  the  unvoiced  stop  /t/  across  the  relatively 
wide  range  of  durations  from  50  to  130  ms.  This  perception  was  much  stronger  than  was 
observed  for  the  tokens  that  preserved  the  middle  portions  of  the  /s/  in  the  same  word  and 
for  the  same  range  of  durations.  Since  the  energy  transition  during  the  onset  of  the 
time-modified  /s/  was  essentially  the  same  for  both  sets  of  tokens,  the  stronger  perception 
of  the  IxJ  was  caused  by  either  (1)  the  formant  frequency  transitions  between  the 
time-modified  consonant  and  the  following  vowel,  or  (2)  the  decay  of  the  energy  of  the 
time-modified  /s/  immediately  preceding  the  onset  of  the  vowel.  The  contribution  (or  lack 
thereof)  of  each  of  these  factors  is  discussed  in  the  following  paragraphs. 

It  can  be  seen  from  Figure  5- lb  that  the  formant  frequencies  in  both  the  consonant 
and  the  vowel  were  similar,  and  were  essentially  constant  for  the  last  175  ms  of  the  /s/.  In 
addition,  there  are  no  observable  formant  transitions  in  the  region  from  sample  numbers 
55(X)  to  6(XX).  Therefore,  it  is  unlikely  that  the  tokens  that  were  created  by  preserving  the 
end  portions  of  the  /s/  contained  significant  shifts  in  the  formant  trajectories  in  the 
consonant-vowel  transition  region.  This  suggests  that  transitions  in  the  formant  frequency 
tracks  were  not  a  likely  cause  of  the  perception  of  the  unvoiced  stop. 

The  only  remaining  cause  for  the  strong  perception  of  the  stop  /t/  was  the  the  decay 
in  the  energy  of  the  time-modified  /s/  preceding  the  onset  of  the  vowel.  This  is  a  likely 
cause,  since  one  of  the  known  acoustic  features  of  a  naturally-produced  unvoiced  stop  is 
a  period  of  (relative)  silence  immediately  following  the  initial  noise  burst  (Dorman  et  al., 
1979;  HaUe  et  al.,  1957;  Klatt,  1975;  Stevens,  1980;  Stevens  and  Blumstein,  1978).  Thus, 
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for  these  tokens,  the  perception  of  a  stop  was  increased  considerably  by  the  decay  in  the 
energy  of  the  time-modified  consonant  after  the  abrupt  onset. 

The  minimum  duration  of  the  time-modified  consonant  required  for  consistent 
perception  of /s/  was  140  ms.  This  value  was  larger  than  was  observed  for  the  two  previous 
sets  of  tokens.  The  higher  threshold  value  was  due  to  the  relatively  strong  cues  that  caused 
the  listener  to  perceive  the  unvoiced  stop  M  over  a  wide  range  of  durations.  Evidently,  the 
perceptual  cues  for  the  stop  /t/  that  were  based  upon  the  energy  contour  of  the  signal  were 
stronger  than  the  perceptual  cues  for  the  voiced  fricative  that  were  based  upon  the  duration 
of  the  high  frequency  noise.  As  a  result,  a  stop  was  heard  over  a  wider  range  of  durations 
than  in  the  previous  sections.  This  "competition"  between  cues  is  important,  and  is 
discussed  further  in  later  sections  of  this  chapter. 

The  results  of  this  section  are  comparable  to  the  results  of  Grimm  (1966).  He 
conducted  similar  experiments  where  the  beginning  portions  of  initial  fricatives  were 
sequentially  removed.  One  of  his  observations  was  that  the  sequence  of  phonemes  that 
resulted  as  the  duration  of  the  /s/  was  increased  from  zero  to  the  unmodified  length  was  /b/ 
to/d/to/t/to/s/.  This  is  the  same  sequence  that  was  observed  in  this  study.  However,  while 
Grimm  found  that  the  duration  of  the  /s/  required  for  50%  identification  was  90  ms,  the 
duration  required  for  50%  identification  in  this  study  was  between  130  and  140  ms. 

5.1.1.4  Comparison  of  the  results  as  a  function  of  position 

The  results  clearly  demonstrated  that  for  durations  between  10  and  130  ms,  the 
position,  or  portion  of  the  phoneme  that  was  preserved  had  a  direct  influence  upon 
perception.  The  factor  that  was  most  affected  was  the  perceived  phoneme  category,  i.e. 
stops  versus  fricatives.  However,  in  several  instances,  the  perceived  place  of  articulation 
was  also  affected. 

Changes  in  the  perception  of  the  phoneme  category  were  attributed  to  the  particular 
envelope,  or  energy  contour,  that  resulted  from  time  modification  of  the  original 
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consonant.  By  modifying  the  energy  contour,  it  was  possible  to  artificially  create  cues  for 
stop  consonants.  If  an  abrupt  transition  in  the  energy  level  at  the  onset  of  the  consonant 
was  created,  a  stop  consonant  was  perceived  by  a  significant  percentage  of  the  listening 
group.  If  this  abrupt  onset  was  accompanied  by  a  gradual  decrease  in  the  envelope  after 
the  abrupt  onset,  this  further  increased  the  perception  of  a  stop. 

In  several  instances,  the  perceived  place  of  articulation  was  affected  by  the  portion 
of  the  phoneme  that  was  preserved.  This  effect  was  observed  primarily  for  the  tokens  that 
had  relatively  short  consonant  durations.  This  was  attributed  to  the  fact  that  the  spectral 
characteristics  of  the  unmodified  consonant  varied  at  the  beginning  and  end  of  the 
consonant. 

5.1.2  The  Word  "Said" 

This  section  discusses  the  results  for  the  63  tokens  that  were  created  from  the  word 
"said."  The  results  show  that  duration  had  a  significant  effect  upon  perception  of  the 
time-modified  /s/.  The  position  of  the  portion  of  the  initial  consonant  that  was  preserved 
had  a  lesser  effect  upon  perception. 

5. 1 .2. 1  Tokens  that  preserved  the  beginning  of  the  /s/ 

This  section  discusses  the  results  for  the  21  tokens  that  preserved  the  beginning 
portions  of  the  /s/  in  the  word  "said."  For  the  range  of  durations  from  zero  to  30  ms,  the 
majority  of  the  Usteners  chose  161  as  the  initial  consonant.  In  addition,  almost  30%  of  the 
listeners  heard  no  initial  consonant  This  is  in  contrast  to  the  results  for  the  same  range  of 
durations  discussed  in  Section  5. 1 . 1 . 1  for  the  word  "sue,"  where  many  of  the  listeners  heard 
the  stop  Pol.  Comparison  of  the  time-domain  waveforms  in  Figures  5-1  a  and  5-2a  shows 
that  the  overall  energy  of  the  original  /s/  for  the  word  "sue"  was  considerably  greater  than 
it  was  for  the  word  "said."  Therefore,  a  possible  explanation  for  perception  of  a  fiicative 
instead  of  a  stop  is  that  the  energy  contour  as  a  function  of  time  for  the  time-modified  /s/ 
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Figure  5-2.  Original  word  "said."  Only  the  consonant  and  a  portion  of  the  vowel  are 
shown.  The  beginning  and  end  of  the  consonant  (based  on  the  automatic 
segmentation  and  labeling  results)  are  shown  by  arrows  above  each  graph. 

a)  Time-domain  waveform; 

b)  Spectrogram. 
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did  not  adequately  resemble  the  energy  release  associated  with  a  stop  consonant  (Behrens 
and  Blumstein,  1988).  Instead,  it  resembled  the  relatively  low  energy  contour  typically 
associated  with  the  weak  fricatives  Idl,  16/ ,  HI  and  /v/. 

The  unvoiced  fricative  Idl  was  perceived  because  it  was  the  only  weak  fricative  in 
the  list  of  possible  choices.  However,  since  it  is  a  linguadental,  its  place  of  articulation  is 
similar  to  the  place  of  articulation  for  the  alveolar  /s/. 

For  durations  of  40  and  50  ms,  the  listeners  were  almost  equally  divided  between 
161  and  /z/,  although  they  usually  chose  Idl.  There  was  also  a  gradual  shift  in  perception 
from  161  to  /z/  that  is  explained  by  the  increase  in  the  total  amount  of  energy  in  the 
time-modified  initial  consonant  that  resulted  as  the  duration  was  increased.  Therefore,  as 
the  duration  (and  the  total  signal  energy)  increased,  perception  shifted  from  a  "weak" 
fricative  to  a  "strong"  fricative.  Since  the  diu^ation  of  the  time-modified  consonant  was  still 
relatively  short  compared  to  the  unmodified  duration,  the  fricative  was  perceived  as  voiced 
(Cole  and  Cooper,  1975). 

The  threshold  for  consistent  perception  of  /s/  was  60  ms.  This  value  was  much 
closer  to  the  results  of  Jongman  (1989)  than  it  was  for  the  word  "sue."  However,  it  was 
40  ms  shorter  than  the  results  of  similar  tests  by  Cole  and  Cooper  (1975). 

5.1.2.2  Tokens  that  preserved  the  middle  of  the  /s/ 

The  results  for  the  21  tokens  that  preserved  the  middle  portions  of  the  /s/  were 
almost  identical  to  the  results  for  the  tokens  that  preserved  the  beginning  portions. 

The  only  significant  difference  between  the  two  sets  of  tokens  was  a  slight 
perception  of  the  voiced  stop  /d/  for  durations  of  10  and  20  ms.  This  occurred  due  to  the 
sudden  onset  of  the  time-modified  initial  consonant.  However,  since  the  energy  change  at 
the  onset  of  the  consonant  was  relatively  small,  only  a  few  of  the  listeners  perceived  a  stop. 
Instead,  over  50%  of  the  listeners  heard  161.  The  explanation  for  the  perception  of  161  is 
the  same  as  was  given  in  the  previous  section. 
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The  threshold  for  consistent  perception  of  /s/  was  50  ms.  This  value  was  10  ms 
shorter  than  the  threshold  for  the  tokens  that  preserved  the  beginning  portion  of  the 
consonant. 

5. 1 .2.3  Tokens  that  preserved  the  end  of  the  /s/ 

The  results  for  the  21  tokens  that  preserved  the  end  portions  of  the  /s/  in  the  word 
"said"  were  similar  to  the  results  for  the  tokens  that  preserved  either  the  beginning  or  the 
middle  portions. 

The  primary  difference  was  that  the  unvoiced  stop  /t/  was  perceived  for  durations 
in  the  range  firom  30  to  70  ms,  and  was  selected  most  often  by  the  listeners  at  durations  of 
40,  50,  and  60  ms.  The  explanation  for  this  is  the  same  as  that  given  for  the  word  "sue." 
That  is,  perception  of  a  stop  consonant  resulted  from  the  combination  of  ( 1 )  the  abrupt  onset 
of  the  hi,  (2)  a  decreasing  energy  contour  after  the  onset,  and  (3)  a  relatively  short  duration 
of  little  or  no  signal  (i.e.  silence)  at  the  end  of  the  /s/.  The  consonant  was  perceived  as 
unvoiced  since  the  duration  of  the  time-modified  /s/  was  greater  than  35  ms  (Lisker  and 
Abramson,  1964).  Since  both  are  alveolar,  the  particular  stop  consonant  /t/  was  perceived 
due  to  its  spectral  similarities  to  /s/. 

The  threshold  for  consistent  perception  of /s/  was  70  ms.  Unlike  the  results  for  the 
word  "sue,"  this  threshold  value  was  20  ms  less  than  that  observed  by  Grimm  (1966). 

5. 1 .2.4  Comparison  of  the  results  as  a  function  of  position 

The  portion  of  the  phoneme  that  was  preserved  had  only  a  small  effect  upon 
perception.  This  effect  was  observed  primarily  for  tokens  with  a  duration  of  70  ms  or  less. 

The  voiced  stop  /d/  was  heard  by  about  20%  of  the  listeners  for  the  token  that 
preserved  the  middle  20  ms  of  the  consonant.  It  was  also  heard  for  the  token  that  preserved 
the  final  30  ms  of  the  consonant.  In  contrast,  less  than  3.7%  of  the  listeners  heard  /d/  for 
these  two  durations  for  the  tokens  that  preserved  the  beginning  of  the  consonant.  Thus, 
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there  was  a  slight  perception  of  a  voiced  stop  for  the  tokens  that  preserved  a  short  portion 
of  either  the  middle  or  the  end  of  the  original  consonant. 

For  durations  in  the  range  from  30  to  70  ms,  between  40%  and  50%  of  the  listeners 
heard  the  unvoiced  stop  /t/  for  the  tokens  that  preserved  the  end  of  the  consonant.  For  the 
tokens  over  this  same  range  of  durations  that  preserved  either  the  beginning  or  the  middle 
of  the  consonant,  either  161  or  M  was  heard  most  often.  Thus,  a  relatively  strong  cue  for 
an  unvoiced  stop  was  created  for  the  tokens  that  preserved  the  end  of  the  consonant. 

These  changes  in  the  perception  of  the  phoneme  category  (i.e.  stops  versus 
fricatives)  were  attributed  to  the  particular  envelope,  or  overall  energy  contour,  that 
resulted  from  time  modification  of  the  original  consonant.  This  result  was  observed,  to  a 
greater  degree,  for  the  word  "sue."  The  difference  in  the  strength  of  the  stop  consonant 
perception  between  the  two  words  was  attributed  to  the  difference  in  the  overall  signal  level 
of  the  unmodified  /s/  in  the  words. 

5.1.3  Summary  for  "Sue"  and  "Said" 

Although  the  /s/  was  modified  in  the  same  manner  for  both  "sue"  and  "said,"  the 
sets  of  tokens  created  from  each  of  these  words  were  usually  perceived  differently  for 
consonant  durations  less  than  about  100  ms.  The  dissimilarities  are  grouped  according  to 
the  token  durations,  and  are  summarized  in  the  following  paragraphs. 

For  durations  of  50  ms  or  less,  the  voiced  stops  /b/  and  /d/  were  often  perceived  for 
the  tokens  created  from  the  word  "sue,"  while  the  unvoiced  fricative  161  was  perceived  for 
the  tokens  created  from  the  word  "said."  Throughout  the  previous  discussions,  this  was 
attributed  to  the  difference  in  the  energy  of  the  unmodified  consonant  between  the  two 
words.  Support  for  this  claim  is  as  follows: 

Previous  work  on  the  average  acoustic  power  of  phonemes  has  shown  that  161  is  the 
weakest  of  all  English  phonemes.  In  addition,  the  average  power  of  /d/  is  8.7  dB  greater 
than  161,  and  the  average  power  of /b/  is  7.7  dB  greater  than  161  (Fry,  1979;  Levitt,  1978). 
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Thus,  when  produced  naturally,  the  consonants  /b/  and  /d/  that  were  perceived  for  the 
tokens  created  from  the  word  "sue"  are  known  to  be  about  8  dB  louder  than  the  consonant 
161  that  was  perceived  for  the  tokens  created  from  the  word  "said." 

Figure  5-3  shows  the  time-domain  waveform  and  the  root-mean-square  (RMS) 
power  for  the  unmodified  words  "sue"  and  "said."  The  RMS  power  was  calculated  once 
per  frame  and  converted  to  dB  by 


RMS(i)  =  201ogio' 


n  =  b 

k  1  ^<">' 


N, 

n  =  a 


(5.1) 


where  i  is  the  current  frame  index,  a  is  the  index  of  the  first  sample  in  the  frame,  b  is  the 
index  of  the  last  sample  in  the  frame,  s(n)  is  the  value  of  the  signal  at  sample  n,  and  Ni  is 
the  length  of  the  current  frame  (in  samples).  The  frame  boundaries  were  the  same  as  those 
used  in  the  pitch-synchronous  LPC  analysis. 

Figure  5-3  shows  that  the  RMS  power  of  the  vowel  portion  in  each  of  the  words 
was  the  same.  This  was  intentional — ^the  amplitude  of  the  each  of  the  four  unmodified 
words  was  scaled  by  a  constant  before  the  test  tokens  were  created  so  that  each  word  was 
presented  to  the  listeners  at  approximately  the  same  volume  level  (see  Section  4.3.1.1). 

However,  the  RMS  power  of  the  initial  consonant  in  the  two  unmodified  words 
differed.  For  the  word  "sue,"  the  RMS  power  was  approximately  59  dB,  while  for  the  word 
"said,"  the  RMS  power  was  approximately  50  dB.  The  difference  was  9  dB.  It  is  important 
to  note  that  this  difference  was  almost  identical  to  the  difference  in  the  average  power 
between  the  two  types  of  sounds  that  were  perceived  (since  /b/  and  /d/  are  about  8  dB 
stronger  than  /df).  This  supports  the  explanation  that  the  difference  in  perception  of  the 
initial  consonant  (stops  versus  fticatives)  was  caused  by  the  difference  in  the  power  of  the 
initial,  unmodified  /s/  in  the  two  words. 
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Figure  5-3.      Time-domain  signal  (top)  and  average  root-mean- square  (RMS)  power  in 
dB  (bottom).  The  abscissa  denotes  the  sample  number  in  each  graph. 

a)  Original,  unmodified  word  "sue;" 

b)  Original,  unmodified  word  "said." 
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A  second  inconsistency  was  that  the  value  of  the  threshold  for  consistent  perception 
of /s/  was  generally  much  larger  for  the  tokens  created  from  the  word  "sue"  than  it  was  for 
the  tokens  created  from  the  word  "said."  The  thresholds  for  each  word  and  for  each 
"modification  method"  (i.e.  the  portion  of  the  consonant  that  was  preserved)  are  listed  in 
Table  5-1.  The  thresholds  are  also  depicted  graphically  in  Figure  4—9.  The  inconsistency 
in  the  thresholds  was  troubling,  because  depending  upon  the  word  that  was  examined,  the 
results  either  agreed  or  disagreed  with  the  values  reported  in  similar  experiments  by 
Jongman  (1989),  Cole  and  Cooper  (1975),  and  Grimm  (1966).  In  an  attempt  to  explain 
the  difference  in  the  thresholds  between  the  two  words,  several  analysis  methods  were 
investigated.  They  are  discussed  in  the  following  paragraphs. 

Table  5-1 .       Thresholds  for  consistent  perception  of  /s/. 


Word 

Portion  of  Consonant 

that  was  Preserved 
(Modification  Method) 

Threshold 

for  Consistent 

Perception  of 

/s/ 

(inms) 

Threshold 
Divided  by 
Unmodified 
Consonant 
Durationi 

Threshold 

Divided  by 

Duration  of 

Following 

Vowel2 

sue 

beginning 

110 

0.456 

0.186 

sue 

middle 

100 

0.414 

0.169 

sue 

end 

140 

0.580 

0.236 

said 

beginning 

60 

0.311 

0.196 

said 

middle 

50 

0.259 

0.163 

said 

end 

70 

0.363 

0.228 

Notes:    1.  The  duration  of  the  unmodified  /s/  in  "sue"  was  241.4  ms,  and 
the  duration  of  the  unmodified  /s/  in  "said"  was  192.7  ms. 

2.  The  duration  of  the  vowel  /u/  in  "sue"  was  592.5  ms,  and  the 
duration  of  the  vowel  /e/  in  "said"  was  306.6  ms. 
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One  theory  that  was  investigated  was  that  the  absolute  threshold  was  not  a  rehable 
method  of  comparing  the  amount  of  the  consonant  that  was  required  for  consistent 
perception.  Instead,  it  was  hypothesized  that  the  boundary  for  consistent  perception  of /s/, 
in  terms  of  the  percentage  of  the  unmodified  consonant  duration,  was  a  more  reliable 
indicator.  This  calculation  effectively  normalized  the  threshold  by  the  original  consonant 
duration.  For  example,  the  first  row  of  Table  5-1  shows  that  the  threshold  for  consistent 
perception  of /s/  for  the  tokens  that  preserved  the  beginning  of  the  /s/  in  the  word  "sue"  was 
110  ms.  The  unmodified  duration  of  the  original  /s/  was  241.4  ms.  Therefore,  the 
normalized  threshold  was  equal  to  110  -^  241.4  =  0.456.  The  fourth  colimm  of  Table  5-1 
lists  these  normalized  thresholds. 

Comparison  between  the  two  thresholds  follows:  For  the  tokens  that  preserved  the 
beginning  portions  of  the  /s/,  the  threshold  for  the  word  "sue"  was  110  ms,  and  the  threshold 
for  the  word  "said"  was  60  ms,  a  45.5%  difference.  The  normalized  threshold  for  the  word 
"sue"  was  0.456,  and  the  normalized  threshold  for  the  word  "said"  was  0.311,  a  31.8% 
difference.  For  the  tokens  that  preserved  the  middle  portions  of  the  /s/,  the  threshold  for 
the  word  "sue"  was  100  ms,  and  the  threshold  for  the  word  "said"  was  50  ms,  a  50.0% 
difference.  The  normalized  threshold  for  the  word  "sue"  was  0.414,  and  the  normalized 
threshold  for  the  word  "said"  was  0.259,  a  37.4%  difference.  For  the  tokens  that  preserved 
the  end  portions  of  the  /s/,  the  threshold  for  the  word  "sue"  was  140  ms,  and  the  threshold 
for  the  word  "said"  was  70  ms,  a  50.0%  difference.  The  normalized  threshold  for  the  word 
"sue"  was  0.580,  and  the  normalized  threshold  for  the  word  "said"  was  0.363,  a  37.4% 
difference.  Since  the  difference  in  the  normalized  thresholds  was  still  relatively  large,  this 
technique  was  not  an  adequate  method  for  comparing  the  results  for  the  two  words. 

However,  an  interesting  result  occurred  if  the  threshold  was  normalized  by  dividing 
it  by  the  unmodified  duration  of  the  following  vowel.  For  example,  the  threshold  for 
consistent  perception  of /s/  for  the  tokens  that  preserved  the  beginning  of  the  /s/  in  the  word 
"sue"  was  110  ms.  The  duration  of  the  vowel  /u/  in  "sue"  was  592.5  ms.  Therefore,  the 
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"vowel-normalized"  threshold  was  equal  to  110  -^  592.5=0.186.  These  results  are  shown 
in  the  last  column  of  Table  5-1.  The  results  show  that  for  a  given  modification  method, 
the  vowel-normalized  thresholds  are  almost  identical  for  the  two  words.  For  example, 
consider  the  tokens  that  were  created  by  preserving  the  end  portion  of  the  original 
consonant.  The  threshold  for  consistent  perception  of /s/  was  140  ms  for  "sue,"  and  70  ms 
for  "said,"  a  50.0%  difference.  However,  the  vowel-normalized  threshold  was  0.236  for 
"sue,"  and  0.228  for  "said,"  a  difference  of  only  3.4%.  Likewise,  the  difference  between 
the  vowel-normalized  thresholds  for  the  two  words  was  3.6%  for  the  tokens  that  preserved 
the  middle  of  the  /s/,  and  5.4%  for  the  tokens  that  preserved  the  beginning  of  the  /s/. 

This  finding  was  important,  because  it  indicated  that  perception  of  the  initial 
consonant  depended  upon  the  duration  of  the  following  vowel.  This  disagreed  with  the 
results  of  Cole  and  Cooper  (1975)  who  showed  that  the  duration  of  the  following  vowel 
had  little  effect  upon  the  perception  of  the  voicing  of  the  initial  consonant. 

This  result  that  the  vowel-normalized  thresholds  were  almost  identical  for  the  two 
words  supports  the  theory  that  human  perception  of  speech  is  not  accomplished  on  a 
"phoneme-by-phoneme"  basis,  and  instead,  is  dependent  upon  the  sequence  of  phonemes 
that  are  present  in  the  speech  signal.  This  theory  has  also  been  investigated  in  other  studies. 
Denes  (1955)  showed  that  the  relative  durations  of  the  vowel  and  the  final  consonant  in 
CVC  words  affected  the  perceived  voicing  of  the  final  consonant.  Similarly,  Raphael 
( 197 1 )  reported  that  the  preceding  vowel  duration  affected  perception  of  the  voicing  of  the 
following  consonant  in  CVC  words,  independent  of  the  consonant  duration.  Although  this 
was  slightly  different  than  the  results  of  Denes,  both  showed  that  the  acoustic  features 
associated  with  one  phoneme  affected  the  perception  of  adjacent  phonemes.  In  two  other 
studies.  Miller  (1981)  and  Miller  and  Liberman  (1979)  showed  that  the  specific  duration 
that  marked  the  threshold  between  perception  of  /b/  and  /w/  in  word-initial  position  in 
single- syllable  words  increased  as  the  syllable  duration  was  increased. 
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However,  the  results  of  the  present  study  still  did  not  account  for  the  fact  that  the 
thresholds  for  consistent  perception  of /s/  for  the  word  "sue"  were  similar  in  value  to  the 
those  reported  by  Cole  and  Cooper  (1975)  and  Grimm  (1966),  while  the  thresholds 
reported  for  the  word  "said"  were  much  smaller,  and  were  closer  to  the  results  reported  by 
Jongman  (1989).  Unfortunately,  none  of  the  prior  studies  reported  the  duration  of  the 
vowels  used  in  their  tokens,  so  application  of  the  vowel-normalization  theory  of  this  study 
to  their  results  was  not  possible. 

5.2  Perception  of  the  Time-Modified  /z/ 

This  section  discusses  the  results  for  the  tokens  that  were  created  by  modifying  the 
duration  and  position  of  the  /z/  in  the  CV  word  "zoo"  and  the  CVC  word  "zed."  The  results 
for  the  two  words  are  discussed  separately,  and  then  compared. 

5.2.1  The  Word  "Zoo" 

This  section  discusses  the  results  for  the  75  tokens  that  were  created  from  the  word 
"zoo."  In  general,  the  results  showed  that  both  duration  and  position  affected  perception 
of  the  time-modified  /z/.  It  was  observed  that  position  affected  perception  due  to  the 
non-stationary  characteristics  of  the  frication  noise  in  the  original  /z/. 

5.2.1. 1  Tokens  that  preserved  the  beginning  of  the  /z/ 

Tests  of  the  25  tokens  that  were  created  by  preserving  the  beginning  portions  of  the 
M  in  the  word  "zoo"  contained  some  surprising  results,  especially  for  intermediate 
durations. 

For  durations  in  the  range  from  zero  to  30  ms,  the  listeners  perceived  no  initial 
consonant  (the  word  "ooh')  most  often.  Stops  were  not  heard  since  there  was  no  sudden 
onset  of  noise.  This  is  because  the  energy  in  the  unmodified  M  built  slowly,  as  shown  in 
Figure  5-4.  Also,  since  the  formant  frequencies  were  similar  for  both  the  consonant  and 
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Figure  5^.  Original  word  "zoo."  Only  the  consonant  and  a  portion  of  the  vowel  are 
shown.  The  beginning  and  end  of  the  consonant  (based  on  the  automatic 
segmentation  and  labeUng  results)  are  shown  by  arrows  above  each  graph. 

a)  Time-domain  waveform; 

b)  Spectrogram. 
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the  vowel,  there  were  no  formant  transitions  to  cue  the  presence  of  a  stop.  This  is  important 
to  note  because  it  has  been  reported  that  formant  transitions  can  signal  a  stop,  even  without 
an  accompanying  noise  burst  (Liberman,  1957;  Liberman  et  al.,  1956).  Due  to  voicing, 
the  energy  was  strongest  at  relatively  low  frequencies  (below  about  500  Hz),  as  seen  in 
Figure  5^b.  This  prevented  perception  of  a  weak  xmvoiced  fricative  such  as /0/.  Although 
weak  voiced  fricatives  such  as  /v/  or  /6/  may  have  been  perceived,  they  were  not  contained 
in  the  list  of  possible  answers. 

An  interesting  and  unanticipated  result  was  the  perception  of  the  liquid  /I/  by  a 
majority  of  the  listeners  over  a  relatively  wide  range  from  40  to  160  ms.  This  perception 
was  quite  strong,  as  shown  by  the  fact  that  60%  to  70%  of  the  listeners  perceived  this 
phoneme.  Explanation  of  this  perception  requires  an  understanding  of  the  similarities  in 
the  speech  production  mechanisms  for  both  /I/  and  /zj.  These  production  mechanisms  are 
explained  in  the  following  paragraphs. 

The  liquid  /I/  is  one  of  two  semivowels  that  exhibits  a  relatively  high  sonorance  (the 
other  is  /r/) .  It  is  produced  by  moving  the  tip  of  the  tongue  towards  the  alveolar  ridge  while 
producing  voicing  at  the  same  time.  Although  the  tip  of  the  tongue  may  contact  the  alveolar 
ridge,  the  airflow  is  not  completely  constricted  due  to  openings  in  the  vocal  tract  at  either 
side  of  the  tongue.  Therefore,  the  liquid  /I/  is  termed  a  "lateral"  (Olive  et  al.,  1993). 

The  liquid  /I/  is  highly  variable  in  terms  of  its  articulation.  Therefore,  it  has 
numerous  allophonic  variations  (Edwards,  1992).  When  spoken  in  word-initial  position 
by  a  male  speaker,  it  has  a  first  formant  frequency  of  approximately  340  Hz,  a  second 
formant  frequency  of  about  1180  Hz,  and  a  third  formant  frequency  of  about  2520  Hz 
(Dalston,  1975).  The  value  of  the  second  formant  frequency  is  slighdy  less  than  t)T)ically 
exhibited  by  other  alveolar  phonemes  due  to  the  openings  in  the  tract  on  either  side  of  the 
tongue.  Note,  however,  that  these  formant  values  are  only  targets — ^the  actual  values  differ 
across  the  allophonic  variations.  When  the  /I/  is  spoken  in  word-initial  position,  the  third 
formant  often  has  a  value  similar  to  that  of  the  following  vowel.  This  is  coincidental,  and 
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cannot  be  attributed  to  coarticulation.  As  a  result,  in  transitions  to  a  following  vowel,  the 
third  formant  frequency  does  not  change  appreciably.  The  lack  of  third  formant  movement 
during  the  transition  to  a  neighboring  vowel  serves  to  differentiate  HI  from  M,  since  /r/  has 
a  third  formant  frequency  of  less  than  2  kHz  (O'Conneret  al.,  1957).  In  addition,  although 
it  is  not  necessary  for  production  of  the  /!/,  the  lips  are  often  rounded  for  production  of  the 
acoustically-similar  /r/  (Borden  and  Harris,  1984). 

The  M  is  an  alveolar  voiced  fricative.  It  is  produced  by  creating  a  constriction 
between  the  tongue  and  the  alveolar  ridge.  A  second  constriction  is  also  formed  between 
the  upper  and  lower  incisors.  The  high  frequency  component  of  the  sound  is  created  by 
turbulent  air  that  is  forced  through  the  constrictions.  The  low  frequency  component  is 
caused  by  the  vibration  of  the  vocal  cords  (Borden  and  Harris,  1984).  Due  to  coarticulation 
effects,  the  formant  frequencies  of /z/  are  heavily  influenced  by  the  following  vowel.  When 
M  is  followed  by  the  vowel  /u/,  the  first  formant  is  approximately  3(X)  Hz,  and  the  second 
formant  is  approximately  16(X)  Hz  for  a  male  speaker  (Olive  et  al.,  1993). 

The  difference  between  production  of  the  two  sounds  is  primarily  in  the  presence 
of  frication  noise  in  the  M.  There  is  also  a  slight  discrepancy  in  the  value  of  the  second 
formant  frequency.  Due  to  the  opening  at  the  sides  of  the  tongue,  the  second  formant 
typically  associated  with  /I/  is  lower  than  normally  observed  for  the  alveolar /z/.  However, 
due  to  the  number  of  ways  that  /I/  can  be  pronounced,  perception  of  /]/  is  not  strictly 
dependent  upon  the  value  of  the  second  formant  frequency. 

Therefore,  the  explanation  of  why  l\J  was  perceived  when  the  beginning  portions 
of  the  M  were  preserved  is  as  follows:  As  discussed,  the  production  mechanisms  for  both 
the  DJ  and  the  /z/  are  similar.  Both  are  voiced,  and  both  require  that  the  tongue  contact  the 
alveolar  ridge.  In  addition,  lip  rounding  often  accompanies  production  of  the  liquid  /r/, 
which  can  be  similar  to  the  /I/.  In  the  original  word  "zoo,"  the  vowel  /u/  is  lip  rounded. 
Since  the  voiced  fricative  /z/  is  highly  susceptible  to  coarticulation  effects  from  the 


192 


following  vowel,  lip  rounding  is  also  exhibited  by  the  initial  M  in  the  word  "zoo."  Thus, 
there  are  many  similarities  between  the  two  phonemes. 

Returning  to  the  particular /z/  that  was  modified,  examination  of  Figure  5-4  shows 
that  the  high  frequency  energy  in  the  unmodified  M  built  slowly,  and  reached  its  peak 
magnitude  near  the  end  of  the  consonant  However,  the  low  frequency  energy  in  the  M 
built  quickly,  and  remained  relatively  constant  in  magnitude  throughout  the  consonant.  In 
addition,  there  was  considerable  energy  in  the  region  of  the  second  and  third  formants  that 
was  essentially  constant  in  magnitude  and  in  frequency  throughout  the  /z/. 

The  voiced  fricative  /z/  is  usually  accompanied  by  a  significant  amount  of  high 
frequency  turbulent  noise.  However,  for  this  specific  occurrence,  the  high  frequency  noise 
occurred  predominantly  at  the  end  of  the  unmodified  consonant,  and  was  missing,  for  the 
most  part,  from  the  beginning  of  the  consonant  Thus,  the  initial  portion  of  the  unmodified 
M  had  a  set  of  acoustic  features  that  more  closely  resembled  the  features  associated  with 
the  liquid  /I/  than  with  the  fricative  ItJ.  In  summary,  while  this  result  is  not  intuitively 
obvious,  the  acoustic  theory  of  production  is  able  to  adequately  explain  the  perception  of 
the  liquid  /!/. 

The  threshold  for  consistent  perception  of  /z/  was  observed  to  be  170  ms.  This 
value  is  also  in  sharp  contrast  to  the  value  of  30  ms  reported  by  Jongman  (1989).  The  large 
difference  was  attributed  to  the  lack  of  high  frequency  energy  in  the  beginning  of  the 
unmodified  M  that  resulted  in  a  strong  cue  for  perception  of  the  liquid  /I/  over  a  wide  range 
of  durations. 

5.2. 1 .2  Tokens  that  preserved  the  middle  of  the  /z/ 

Although  there  were  differences,  results  for  the  25  tokens  that  were  created  by 
preserving  the  middle  portions  of  the  /z/  in  the  word  "zoo"  followed  approximately  the 
same  pattern  as  the  results  for  the  tokens  that  preserved  the  beginning  portions. 
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The  listeners  perceived  no  initial  consonant  (the  word  "ooh')  most  often  only  for 
a  duration  of  zero  ms.  Unlike  the  previous  results,  this  perception  decreased  sharply  for 
durations  of  10  ms  and  greater.  Above  10  ms,  a  small  percentage  of  the  listeners  heard 
either  the  voiced  stop  /d/  or  the  unvoiced  fricative  161.  However,  the  largest  group  of  the 
listeners  (approximately  50%)  heard  /I/.  Examination  of  the  spectrogram  in  Figure  5-4b 
shows  that  there  was  insufficient  high  frequency  energy  present  in  the  middle  portion  of 
the  unmodified  M  for  perception  of  a  fricative.  Therefore,  for  the  reasons  that  were 
described  in  the  previous  section,  /]/  was  perceived  instead  of /z/. 

It  is  interesting  to  note  that  as  the  duration  of  the  consonant  was  increased  from  10 
to  120  ms,  increasing  amounts  of  the  high  frequency  energy  present  in  the  final  portion  of 
the  unmodified  /z/  were  included  in  the  time-modified  tokens.  As  a  result,  perception  of 
Izl  rose  steadily. 

The  threshold  for  consistent  perception  of  Izl  was  130  ms.  As  in  the  preceding 
section,  this  relatively  large  value  is  attributed  to  the  strong  cues  that  signaled  the  presence 
of  the  Uquid  /l/  over  a  wide  range  of  durations. 

5.2. 1.3  Tokens  that  preserved  the  end  of  the  Izl 

Results  for  the  25  tokens  that  were  created  by  preserving  the  end  portions  of  the  Izl 
in  the  word  "zoo"  were  dissimilar  to  the  results  for  the  tokens  that  preserved  either  the 
beginning  or  the  middle  of  the  /z/.  This  was  due  to  the  fact  that  the  fiication  was  most 
prevalent  at  the  end  of  the  original  Izl. 

Over  the  range  from  zero  to  40  ms,  the  initial  consonant  was  perceived  most  often 
as  missing.  At  first  glance,  this  is  puzzling,  since  there  should  have  been  sufficient  high 
frequency  energy  at  die  end  of  the  M  to  have  caused  perception  of  either  a  stop  or  a 
fricative.  Examination  of  the  spectrogram  in  Figure  5— 4b  offers  an  explanation:  The 
boundary  between  the  Izl  and  the  /u/  in  the  unmodified  word  "zoo"  was  calculated  by  the 
automatic  analysis  algorithms  to  occur  at  sample  number  5933.  This  is  20  to  30  ms  c^er 
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the  end  of  the  high  frequency  component  of  the  Izl.  Although  it  is  debatable,  the  boundary 
between  the  /z/  and  /u/  actually  occurs  two  or  three  pitch  periods  earlier  than  marked.  Thus, 
no  initial  consonant  was  perceived  for  the  tokens  with  a  duration  in  the  range  from  zero 
to  40  ms  because  the  portions  of  the  original  consonant  that  were  preserved  were  actually 
part  of  the  vowel  /u/,  and  contained  no  frication  noise. 

Perception  of  the  token  with  a  duration  of  50  ms  was  split  almost  equally  between 
Idl,  /I/,  161,  and  /z/,  although  /d/  was  perceived  most  often.  This  result  shows  the  existence 
of  multiple,  conflicting  cues  for  the  various  consonants.  Perception  of  the  token  with  a 
duration  of  60  ms  was  split  between  Izl  and  /I/,  with  /I/  being  selected  most  often.  While 
multiple  cues  still  existed,  the  cues  for  /d/  diminished  since  the  duration  was  longer  than 
40  ms  (the  boundary  between  perception  of  a  voiced  and  an  unvoiced  stop),  and  the  cues 
for  Id!  diminished  since  the  total  energy  of  the  time-modified  consonant  increased  above 
the  level  normally  exhibited  by  a  weak,  unvoiced  firicative. 

As  expected,  the  tokens  that  had  a  duration  in  the  range  fiiom  70  to  240  ms  were 
not  widely  perceived  as  /I/.  Although  20%  to  30%  of  the  listeners  did  perceive  the  liquid, 
they  were  in  the  minority.  The  majority  of  the  listeners  chose  the  initial  consonant  as  /z/, 
due  to  the  presence  of  high  frequency  frication  noise. 

The  threshold  for  consistent  perception  of /z/  was  observed  to  be  70  ms.  This  value 
was  much  less  than  it  was  for  the  tokens  that  preserved  either  the  beginning  or  the  middle 
of  the  consonant.  Since  the  cues  for  IM  did  not  dominate  perception  for  these  tokens,  this 
value  is  closer  to  the  results  of  previous  work  than  the  value  for  the  tokens  that  preserved 
either  the  beginning  or  the  middle  of  the  M  (Cole  and  Cooper,  1975;  Grimm,  1966). 

5.2.1.4  Comparison  of  the  resuhs  as  a  function  of  position 

The  results  demonstrated  that  for  durations  between  10  and  170  ms,  the  portion  of 
the  phoneme  that  was  preserved  had  a  clear  influence  upon  perception.  The  liquid  /!/  was 
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heard  for  the  tokens  that  preserved  either  the  beginning  or  the  middle  of  the  original  /z/, 
and  the  voiced  fricative  /z/  was  heard  for  the  tokens  that  preserved  the  end  of  the  M. 

However,  due  to  the  inconsistent  spectral  characteristics  of  the  firication  component 
of  the  original  /z/,  it  was  impossible  to  ascertain  if  the  effect  of  position  upon  perception 
was  independent  of  frequency.  This  is  because  perception  of  a  voiced  fricative  was  shown 
to  be  dependent,  in  part,  upon  the  frequency  characteristics  of  the  time-modified 
consonant.  In  particular,  accurate  perception  of  /z/  depended  upon  the  presence  of  an 
adequate  amoimt  of  high  frequency  noise. 

5.2.2  The  Word  "Zed" 

This  section  discusses  the  results  for  the  57  tokens  that  were  created  from  the  word 
"zed."  The  results  show  that  duration  had  a  significant  effect  upon  perception  of  the 
time-modified  M.  The  position  of  the  portion  of  the  initial  consonant  that  was  preserved 
only  had  a  significant  effect  for  consonant  durations  of  50  ms  or  less. 

5.2.2. 1  Tokens  that  preserved  the  beginning  of  the  Izl 

This  section  discusses  the  results  for  the  19  tokens  that  preserved  the  beginning 
portions  of  the  M  in  the  word  "zed."  For  durations  in  the  range  from  zero  to  50  ms,  161  was 
usually  heard  as  the  initial  consonant.  Examination  of  the  spectrogram  in  Figure  5-5b 
shows  that  considerable  low  frequency  energy  (due  to  voicing)  was  present  during  the 
beginning  of  the  unmodified  /z/.  Therefore,  it  is  unusual  that  an  unvoiced  fricative  was 
perceived,  since  the  presence  of  low  frequency  energy  is  a  strong  cue  for  voicing. 

One  explanation  for  this  is  that  some  of  the  listeners  may  have  chosen  the  answer 
"thed"  when  either  the  unvoiced  161  or  the  voiced  /6/  was  perceived.  While  the  listeners 
were  told  in  the  test  instructions  that  the  "th"  in  the  answers  "thoo"  and  "thed"  were  both 
unvoiced,  there  was  no  word  choice  with  the  initial  consonant  /6/  in  the  list  of  possible 
answers.  Therefore,  it  is  possible  that  some  of  the  listeners  forgot  the  instructions  and  chose 
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Figure  5-5.  Original  word  "zed."  Only  the  consonant  and  a  portion  of  the  vowel  are 
shown.  The  beginning  and  end  of  the  consonant  (based  on  the  automatic 
segmentation  and  labeling  results)  are  shown  by  arrows  above  each  graph. 

a)  Time-domain  waveform; 

b)  Spectrogram. 
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the  word  with  the  "th"  in  the  initial  position,  regardless  of  the  perceived  voicing.  This 
would  help  explain  why  the  results  at  short  durations  show  that  an  unvoiced  sound  was 
perceived  when  a  strong  voicing  component  was  present.  Note  that  if  this  was  the  case, 
it  demonstrates  a  defect  in  the  set  of  answers  presented  to  the  listeners. 

The  threshold  for  consistent  perception  of /z/  was  observed  to  be  60  ms.  This  value 
is  twice  the  value  of  30  ms  reported  by  Jongman  (1989). 

5.2.2.2  Tokens  that  preserved  the  middle  of  the  Izl 

When  the  /z/  was  completely  removed,  the  listeners  perceived  161  (or  possibly  /6/, 
see  the  above  discussion)  most  often.  For  the  range  from  1 0  to  30  ms,  a  large  majority  chose 
the  voiced  /d/  as  the  initial  consonant.  The  perception  was  quite  strong,  and  was  limited 
to  this  small  range  of  durations.  Perception  of  the  stop  was  caused,  in  part,  by  the  abrupt 
onset  of  the  time-modified  Izl.  The  stop  was  perceived  as  voiced  due  to  the  strong 
fundamental  frequency  component.  The  alveolar  stop  /d/  was  perceived  since  it  was  closest 
in  frequency  content  to  the  original  alveolar  voiced  fricative  Izl. 

A  sharp  shift  in  perception  between  /d/  and  /z/  occurred  at  a  duration  of  40  ms.  At 
a  duration  of  30  ms,  /d/  was  perceived  by  79.6%  of  the  listeners.  Conversely,  for  durations 
of  40  ms  and  greater,  Izl  was  perceived  by  92.6%  (or  greater)  of  the  listeners.  Several 
factors  led  to  this  sudden  shift  in  perception.  First,  the  value  of  40  ms  was  within  5  ms  of 
the  duration  that  typically  marks  the  threshold  between  perception  of  voiced  and  unvoiced 
stops  (Lisker  and  Abramson,  1964).  The  results  of  this  study  agree  with  this  value,  since 
/d/  was  perceived  by  the  majority  of  listeners  at  10,  20,  and  30  ms.  However,  unvoiced 
stops  were  not  perceived  for  longer  durations,  as  might  be  expected.  This  was  due  to  the 
presence  of  a  strong  low  frequency  voicing  component  that  precluded  the  perception  of  an 
unvoiced  sound.  Therefore,  due  to  the  combination  of  voicing  and  high  frequency  noise, 
a  voiced  fricative  was  heard  for  consonant  durations  of  40  ms  or  greater. 
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The  threshold  for  consistent  perception  of /z/  was  observed  to  be  40  ms.  This  value 
is  20  ms  less  than  the  value  for  the  tokens  that  preserved  the  beginning  of  the  consonant. 

5.2.2.3  Tokens  that  preserved  the  end  of  the  ItJ 

This  section  discusses  the  results  for  the  19  tokens  that  were  created  by  preserving 
the  end  portion  of  the  /z/  in  the  word  "zed."  In  general,  the  results  were  similar  to  the  results 
for  the  tokens  that  were  created  by  preserving  the  beginning  portions  of  the  consonant 

For  durations  in  the  range  from  zero  to  30  ms,  the  listeners  were  divided  between 
161  and  /z/.  Like  all  of  the  tokens  that  were  created  from  the  word  "zed,"  it  is  hypothesized 
that  some  of  the  listeners  did  not  distinguish  between  the  unvoiced  161  and  the  voiced  /6/. 
This  caused  the  misleading  result  that  the  listeners  perceived  an  unvoiced  initial  consonant 
for  tokens  that  had  a  distinct  voiced  component. 

The  threshold  for  consistent  perception  oflzj  was  observed  to  be  40  ms.  Although 
Grimm  (1966)  did  not  report  individual  numerical  values  for  his  voiced  fricative 
experiments,  he  did  note  that  "the  voiceless  fricatives  relative  to  the  voiced  ones  required 
twice  as  much  of  the  noise  portion  to  be  50%  intelligible."  Therefore,  if  this  statement  is 
interpreted  literally  for  the  /s,  z/  pair,  one-half  of  his  reported  threshold  for  /s/  is 
approximately  40  ms.  This  "interpolated"  value  is  identical  to  the  threshold  observed  in 
this  study. 

5.2.2.4  Comparison  of  the  resuhs  as  a  function  of  position 

The  results  showed  an  interesting  result  for  the  word  "zed."  For  durations  of  10, 
20  and  30  ms,  16/  was  perceived  by  79.6%,  63.0%,  and  79.6%  of  the  listeners,  respectively, 
for  the  tokens  that  were  created  from  the  middle  portion  of  the  original  consonant  For  the 
same  three  durations,  the  perception  of /d/  for  the  tokens  that  were  created  from  either  the 
beginning  or  the  end  of  the  original  consonant  was  less  than  11.1%,  the  level  of  chance. 
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Since  a  cue  for  the  perception  of  stops  is  the  sudden  onset  of  the  time-modified 
consonant,  the  tokens  that  were  created  from  the  end  portions  of  the  fzj  should  also  have 
caused  perception  of /d/  for  the  durations  of  1 0, 20,  and  30  ms.  This  is  because  these  tokens 
had  approximately  the  same  abrupt  onset  of  the  time-modified  consonant  as  did  the  tokens 
created  from  the  middle  of  the  consonant  However,  /d/  was  not  perceived  for  the  tokens 
that  preserved  the  end  of  the  original  consonant.  Examination  of  the  time-domain 
waveform  in  Figure  5-5a  and  the  spectrogram  in  Figure  5-5b  provides  clues  for  this 
unexpected  result.  The  figure  shows  that  the  frequency  distribution  of  the  original  /zj  was 
not  constant  over  time.  The  beginning  portion  contained  mostly  low  frequency  voicing. 
The  middle  portion  contained  less  voicing  than  the  beginning,  as  well  as  a  relatively  large 
amount  of  high  frequency  noise.  The  end  portion  contained  voicing,  a  significant  amount 
of  energy  at  the  second  and  third  formant  frequencies,  and  less  high  frequency  noise  than 
the  middle  portion.  Therefore,  the  /d/  was  not  perceived  for  the  tokens  that  preserved  the 
end  of  the  consonant  due  to  the  ratio  of  the  energy  at  the  formant  frequencies  to  the  energy 
of  the  turbulent  noise.  The  energy  at  the  formant  frequencies  presented  a  strong  cue  for 
a  sustained  sound.  This  cue  was  able  to  override  the  cue  for  a  voiced  stop  that  was  created 
by  the  sudden  onset  of  the  time-modified  consonant  and  accompanying  voicing.  This 
example  showed  how  contrasting  cues  sometimes  existed  simultaneously.  When  this 
occurred,  perception  was  dependent  upon  the  relative  strength  of  each  of  the  cues  that  were 
present. 

5.2.3  Summary  for  "Zoo"  and  "Zed" 

Although  the  /zj  was  modified  in  the  same  manner  for  both  "zoo"  and  "zed,"  the 
sets  of  tokens  created  from  each  of  the  words  were  usually  perceived  differently  for  almost 
all  of  the  consonant  durations  that  were  investigated.  The  dissimilarities  are  grouped 
according  to  the  token  durations,  and  are  summarized  in  the  following  paragraphs. 
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For  durations  of  50  ms  or  less,  either  no  initial  consonant  or  the  liquid  N  was 
perceived  for  the  tokens  created  from  the  word  "zoo,"  while  either  the  unvoiced  fricative 
Idl  or  the  stop  /d/  was  perceived  for  the  tokens  created  from  the  word  "zed."  The  perceptual 
differences  were  attributed  to  the  variations  in  the  spectral  content  of  the  original  /z/  in  both 
words. 

In  the  word  "zoo,"  the  original  /z/  exhibited  strong  energy  at  the  formant 
frequencies  that  did  not  vary  over  time.  However,  the  frication  was  observed  to  occur 
mostly  at  the  end  of  the  consonant.  In  addition,  the  boundary  between  the  Izl  and  the 
following  vowel  /u/  was  (incorrectiy)  marked  approximately  30  ms  after  the  actual  end  of 
the  frication.  These  factors  contributed  to  the  perception  that  the  initial  consonant  was  (1) 
missing  for  tokens  that  were  created  from  either  the  beginning  or  end  of  the  consonant,  and 
(2)  the  liquid  /!/  for  the  tokens  that  were  created  from  the  middle  of  the  consonant. 

In  the  word  "zed,"  the  beginning  of  the  M  was  comprised  primarily  of  a  voice  bar. 
The  middle  of  the  IzJ  mostiy  contained  frication  and  voicing,  and  the  end  of  the  M 
contained  voicing,  a  substantial  amount  of  formant  energy,  and  relatively  less  frication. 
For  consonant  durations  of  50  ms  or  less,  these  spectral  characteristics  created  the 
perception  that  the  initial  consonant  was  the  unvoiced  fricative  161  for  the  tokens  that 
preserved  the  beginning  of  the  original  Izl.  However,  it  was  noted  that  there  was  the 
possibility  that  some  of  the  listeners  actually  heard  the  consonant  /6/.  The  tokens  that 
preserved  the  middle  of  the  consonant  were  perceived  as  /d/,  due  to  the  abrupt  onset  and 
the  presence  of  voicing.  The  tokens  that  preserved  the  end  of  the  Izl  were  perceived  as 
either  jdj  (or  possibly  /5/)  or  IzJ.  Despite  the  sudden  onset,  a  stop  was  not  perceived  due 
to  the  strong  cues  for  a  sustained  sound  created  by  the  large  amount  of  formant  energy. 

For  durations  in  the  range  from  approximately  50  to  150  ms,  strong  cues  for  the 
presence  of  the  liquid  /I/  were  created  by  preserving  either  the  beginning  or  the  middle  of 
the  consonant  from  the  word  "zoo."  Conversely,  these  cues  were  much  weaker  for  the 
tokens  that  preserved  the  end  of  the  consonant.  Again,  this  was  attributed  to  the  differences 
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in  the  spectrum  between  the  various  portions  of  the  original  M.  For  this  same  range  of 
durations  for  the  word  "zed,"  the  cues  for  the  liquid  /!/  were  not  present.  Instead,  M  was 
perceived  in  almost  all  cases. 

Although  it  was  not  mentioned  previously,  there  was  also  a  secondary  factor  that 
may  have  influenced  perception  of  the  tokens  with  consonant  durations  from  50  to  150  ms. 
The  factor  was  related  to  the  amplitude  of  the  original  /z/.  Note,  however,  that  this  factor 
was  not  considered  to  be  the  principle  cause  for  perception  of  the  liquid  /I/.  Figure  5-6 
shows  the  time-domain  waveform  and  the  RMS  power  for  the  unmodified  words  "zoo"  and 
"zed"  (the  figure  is  similar  to  Figure  5-3  for  the  "sue/said"  word  pair).  The  RMS  power 
was  calculated  once  for  each  frame  from  Equation  5.1.  The  RMS  power  of  the  vowel 
portion  of  each  of  the  words  was  the  same  (see  Section  4.3.1.1).  However,  the  power  of 
the  initial  M  differed:  For  "zoo,"  it  was  about  -6  dB  relative  to  the  vowel,  while  for  "zed," 
it  varied  between  -15  and  -8  dB  relative  to  the  vowel.  Despite  the  variations,  the  power 
of  the  original  /z/  for  the  word  "zed"  was  generally  less  than  it  was  for  the  word  "zoo." 

Previous  work  on  the  average  acoustic  power  of  English  phonemes  has  shown  that 
relative  to  the  vowel  /u/,  (1)  the  average  level  of /I/  is  -2.4  dB,  (2)  the  average  level  of /z/ 
is-11.7  dB,and  (3)  the  average  level  of /8/is-11.4dB  (Fry,  1979;  Levitt,  1978).  In  effect, 
when  produced  naturally,  the  average  level  of /2y  and  /6/  are  both  about  9dB  less  than  the 
average  level  of /I/.  This  correlates  well  with  the  fact  that  the  level  of  the  original  consonant 
in  the  word  "zed"  (that  led  to  perception  of  both  IzJ  and  /6/)  was  between  2  to  9  dB  less  than 
the  level  of  the  original  consonant  in  the  word  "zoo"  (that  led  to  perception  of /I/).  As  a 
reminder,  these  effects  are  considered  to  be  secondary  cues  that  may  have  helped  to 
influence  perception.  They  are  not  considered  further  due  to  the  fact  that  the  reported 
acoustic  power  of  each  of  the  phonemes  is  an  average.  As  a  result,  the  exact  level  of  the 
signal  cannot  be  considered  to  be  a  necessary  cue  for  perception,  but  rather  only  a 
supporting  factor. 
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Figure  5-6.      Time-domain  signal  (top)  and  average  root-mean- square  (RMS)  power  in 
dB  (bottom).  The  abscissa  denotes  the  sample  number  in  each  graph. 

a)  Original,  unmodified  word  "zoo;" 

b)  Original,  unmodified  word  "zed." 
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A  discrepancy  also  existed  when  the  thresholds  for  consistent  perception  of /z/  were 
compared  for  the  two  words.  This  was  similar  to  the  discrepancy  that  existed  for  the 
"sue/said"  word  pair,  namely,  depending  upon  the  word  that  was  examined,  the  results 
either  agreed  or  disagreed  with  the  values  reported  in  similar  experiments  by  Jongman 
(1989),  Cole  and  Cooper  (1975),  and  Grimm  (1966). 

Explanation  of  the  wide  range  of  thresholds  was  complicated  by  the  fact  that 
perception  of  the  liquid  l\J  dominated  the  results  for  many  of  the  tokens  created  from  the 
word  "zoo."  Perception  of  the  liquid  was  attributed  to  the  specific  articulation  that  was 
used  by  the  male  speaker  to  produce  the  word  "zoo."  Since  M  is  normally  produced  with 
a  strong,  steady  frication  component  (Edwards,  1992),  it  is  believed  that  the  perception  of 
HI  in  this  study  could  not  be  interpreted  as  a  result  that  would  routinely  be  observed  in 
further  work.  Therefore,  in  the  following  discussion,  the  results  for  the  tokens  that 
produced  the  large  number  of  HI  responses  were  not  included.  Thus,  the  following 
comparison  between  the  two  words  focused  only  on  the  thresholds  for  the  tokens  that 
preserved  the  end  portions  of  the  original  consonant. 

The  observed  thresholds  for  consistent  perception  of  M  are  listed  in  the  third 
column  of  Table  5-2.  The  results  are  given  for  both  words  and  for  each  of  the  three 
modification  methods  (i.e.  the  portion  of  the  consonant  that  was  preserved).  This  is  also 
depicted  graphically  in  Figure  4—10.  In  an  attempt  to  explain  the  differences  in  the 
thresholds  between  the  two  words,  the  same  analysis  methods  that  were  followed  for  the 
"sue/said"  word  pair  were  investigated.  They  are  discussed  in  the  following  paragraphs. 

The  fourth  column  in  Table  5-2  shows  the  normalized  thresholds  that  were 
calculated  by  dividing  the  threshold  by  the  duration  of  the  unmodified  ItJ.  For  the  tokens 
that  preserved  the  end  portions  of  the  /z/,  the  threshold  for  the  word  "zoo"  was  70  ms,  while 
the  threshold  for  the  word  "zed"  was  40  ms,  a  43%  difference.  For  these  same  tokens,  the 
normalized  threshold  for  the  word  "zoo"  was  0.295,  while  the  normalized  threshold  for  the 
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word  "zed"  was  0.230,  a  22%  difference.  These  results  show  that  the  normalized  threshold 
was  not  a  reliable  technique  for  comparing  the  thresholds  for  the  two  words. 

Table  5-2.       Thresholds  for  consistent  perception  of  /z/. 


Word 

Portion  of 

Consonant  that  was 

Preserved 

(Modification 

Metiiod) 

Threshold 

for  Consistent 

Perception  of /z/ 

(in  ms) 

Threshold 

Divided  by 

Unmodified 

Consonant 

Durationi 

Threshold 

Divided  by 

Duration  of 

Following 

Vowel2 

Z003 

beginning 

170 

0.718 

0.336 

Z003 

middle 

130 

0.549 

0.257 

ZOO 

end 

70 

0.295 

0.138 

zed 

beginning 

60 

0.345 

0.192 

zed 

middle 

40 

0.230 

0.128 

zed 

end 

40 

0.230 

0.128 

Notes:    1.  The  duration  of  the  unmodified  /z/  in  "zoo"  was  236.9  ms,  and 
the  duration  of  the  unmodified  M  in  "zed"  was  174.1  ms. 

2.  The  duration  of  the  vowel  /u/  in  "zoo"  was  506.5  ms,  and  the 
duration  of  the  vowel  /e/  in  "zed"  was  311.8  ms. 

3.  The  thresholds  in  this  row  were  largely  influenced  by  cues  for  of 
the  liquid  /I/.  See  text  for  explanation. 


The  fifth  column  of  Table  5-2  lists  the  vowel-normalized  thresholds.  For  the 
tokens  that  preserved  the  end  portions  of  the  M,  the  vowel-nonnalized  threshold  for  the 
word  "zoo"  was  0.138,  while  the  vowel- normalized  threshold  for  the  word  "zed"  was 
0.128,  a  7%  difference.  This  calculation  shows  that  the  duration  of  the  following  vowel 
influenced  the  threshold  for  consistent  perception  of  the  /z/,  at  least  for  the  tokens  that  were 
created  by  preserving  the  end  of  the  original  /z/.  This  result  was  similar  to  that  observed 
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for  the  "sue/said"  word  pair,  namely,  that  the  duration  of  the  following  vowel  affected 
perception  of  the  initial  time-modified  consonant. 

5.3  General  Observations 

The  results  discussed  here  supported  earlier  work  that  showed  that  duration 
affected  perception  of  the  initial  consonant  in  single-syllable  CV  and  CVC  words.  This 
study  demonstrated  that  in  certain  cases,  shortening  the  duration  of  an  unvoiced  fricative 
could  create  cues  for  a  voiced  fricative.  The  results  also  showed  that  the  envelope  of  the 
time-modified  fricative  could  be  adjusted  to  create  cues  for  both  voiced  and  unvoiced  stops. 

The  spectral  characteristics  of  the  portion  of  the  consonant  that  was  preserved 
played  several  roles.  Not  surprisingly,  the  perceived  place  of  articulation  was  strongly 
dependent  upon  the  frequency  content  of  the  time-modified  consonant.  When  the 
frequency  content  was  modified  to  contain  spectral  features  largely  different  than  the 
original  sound,  the  perceived  place  of  articulation  shifted.  An  example  of  this  is  the 
perception  of  both  /b/  and  /d/  for  the  tokens  that  preserved  the  end  of  the  /s/  in  the  word 
"sue."  While  all  of  the  tokens  were  derived  from  the  alveolar  /s/,  both  a  labial  /b/  and  an 
alveolar  /d/  were  perceived  at  different  durations,  primarily  due  to  the  center  frequency  of 
the  portion  of  the  frication  that  was  preserved  in  the  tokens. 

It  has  been  discussed  that  the  spectral  content  of  the  fricatives  changed 
unexpectedly  over  time  in  the  four  original,  unmodified,  words.  The  amount  of  this 
spectral  movement  was  clearly  greater  than  would  normally  be  expected  due  to 
coarticulation,  and  it  emphasized  the  variability  that  exists  in  the  production  of  English 
phonemes. 

However,  this  spectral  variability  also  helped  to  demonstrate  something  else:  It 
showed  that  perception  of  phonemes  was  apparently  accomplished  using  an  assortment  of 
acoustic  cues.  In  addition,  the  experiments  showed  that  for  some  tokens,  cues  for  two 
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conflicting  sounds  were  present,  and  that  the  resulting  phoneme  that  was  perceived 
depended  upon  the  relative  strength  of  the  conflicting  cues. 

As  an  example,  consider  the  tokens  that  were  created  by  preserving  the  final  10  to 
30  ms  of  tiie  /z/  in  the  word  "zed."  Cues  for  a  stop  were  created  by  the  abrupt  onset  of  the 
time-modified  consonant.  Additional  acoustic  features  present  included  voicing,  and  a 
consonant  duration  of  less  than  35  ms.  These  features  strengthened  the  cue  for  a  voiced 
stop.  In  contrast,  there  were  also  cues  present  for  a  sustained  soimd  due  to  the  large  amount 
of  energy  present  at  the  first  three  formant  frequencies.  The  cues  for  the  sustained  sound 
were  enhanced  by  the  fact  that  the  formant  frequencies  did  not  vary  appreciably  during  the 
original  /z/,  nor  during  the  transition  to  the  following  vowel.  Since  the  listeners  selected 
either  /z/  or  161  for  these  tokens,  the  cues  for  the  stop  were  evidently  overshadowed  by 
the  cues  for  the  sustained  sound. 

Despite  the  existence  of  conflicting  cues,  in  general,  the  results  showed  that  the 
phoneme  category  that  was  perceived  was  usually  attributable  to  the  dominance  of  a  single 
cue.  It  is  interesting  to  observe,  however,  that  as  the  consonant  duration  was  increased, 
these  dominating  cues  tended  to  appear  and  disappear  quite  suddenly,  depending  upon  the 
specific  acoustic  composition  of  the  time-modified  signal.  This  result  supported  the 
popular  theory  that  humans  use  "categorical  perception"  in  processing  speech  (Liberman 
et  al.,  1957).  Although  this  concept  is  beyond  the  scope  of  this  study,  the  theory  is  based 
upon  the  idea  that  as  the  value  of  an  acoustic  feature  is  changed  over  a  continuum,  there 
are  specific  ranges  of  values  over  which  humans  are  unable  to  detect  a  perceptual  change. 
If  the  value  of  the  acoustic  feature  is  varied  across  the  boundary  between  two  of  these 
ranges,  there  is  a  distinct  change  in  perception.  However,  as  long  as  the  value  of  the 
acoustic  feature  is  kept  within  a  single  range,  perception  remains  relatively  constant. 


CHAPTER  6 
SUMMARY  AND  CONCLUSIONS 


This  chapter  presents  a  brief  summary  of  both  the  time  modification  system  and  the 
listening  tests.  Recommendations  for  further  research  are  then  presented. 

6.1  Summary 

This  study  accompUshed  two  main  tasks:  The  first  was  the  development  of  a  new 
software  system  that  is  capable  of  producing  high  quality,  time-modified  speech  tokens. 
The  system  was  developed  to  aid  researchers  in  the  creation  of  speech  stimuli  for  use  in 
studies  of  human  perception.  The  second  task  that  was  accomplished  was  the  verification 
of  the  system's  performance.  Both  informal  Ustening  tests  (pilot  studies)  and  a  formal 
listening  test  were  conducted.  For  the  formal  listening  test,  a  total  of  270  time-modified 
speech  tokens  were  created  by  preserving  different  portions  and  durations  of  the  initial 
consonant  in  the  words  "sue,"  "zoo,"  "said,"  and  "zed."  The  tokens  were  used  to  measure 
the  effects  of  both  duration  and  the  portion  of  the  phoneme  that  was  preserved  upon  the 
perception  of  the  initial  consonant.  The  time  modification  system  and  the  listening  tests 
are  summarized  in  the  following  two  sections. 

6. 1 . 1  The  Time  Modification  System 

The  time  modification  system  developed  in  this  study  provides  the  speech 
researcher  with  a  convenient  and  easy-to-use  tool  that  offers  a  high  degree  of  precision  and 
flexibility  without  cumbersome  waveform  editors  or  complicated  command-line  syntax. 
The  system  is  controlled  by  a  graphical  user  interface  (GUI)  that  is  comprised  of  a  series 
of  windows  that  guide  the  user  in  a  logical  manner  through  the  time  modification  process. 
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All  of  the  user-adjustable  parameters  are  modified  by  using  the  workstation  mouse  to  select 
and  adjust  sliders  and  push-buttons  that  are  displayed  in  the  various  windows.  The  system 
is  written  entirely  in  the  MATLAB  programming  language.  As  a  result,  the  software  is 
easily  portable  to  other  hardware  platforms  and  operating  systems. 

The  time  modification  system  consists  of  three  stages.  The  first  stage,  also  known 
as  the  Segmentation  and  Labeling  (S&L)  stage,  divides  the  signal  into  phoneme- type 
segments  and  then  labels  each  segment  as  either  vowel,  nasal,  semivowel,  voiced  fticative, 
voice  bar,  unvoiced  fticative,  unvoiced  stop,  or  silent.  To  accomplish  this,  the  signal  is  first 
divided  pitch- synchronously  into  frames.  A  LPC  analysis  is  performed  on  each  frame,  and 
the  residue  is  saved  for  later  use  in  the  synthesis  process.  The  LPC  coefficients  are  analyzed 
on  a  frame-by-frame  basis  by  a  series  of  "feature  detector"  programs  that  determine  the 
presence  (or  absence)  of  numerous  acoustic  features.  The  feature  detectors  operate  by 
examining  the  spectral  distribution  of  the  signal.  The  signal  is  then  parsed  into 
multiple-frame-length  segments  of  unknown  phonemic  type  based  upon  both  the  VAJ/S 
classification  and  the  short-term  changes  in  the  frequency  spectra.  Each  segment  is  labeled 
with  one  of  the  eight  segment  types,  i.e.  vowel,  nasal,  etc.  This  is  accomplished  by 
comparing  the  average  level  of  each  the  feature  detector  outputs  for  the  segment. 

The  second  stage  allows  the  user  to  manually  edit  the  results  produced  by  the 
automatic  S&L  algorithms.  Note  that  this  editing  process  is  not  strictly  required — ^it  is  only 
invoked  if  the  user  determines  that  the  automatic  S&L  results  are  in  error.  The  editing 
programs  are  controlled  by  a  GUI  that  allows  the  user  to  display  the  results  and  edit  either 
the  segment  labels  or  the  position  of  any  of  the  boundaries  between  the  segments.  The  user 
edits  the  S&L  parameters  by  selecting  (via  a  mouse)  push-buttons  and  sliders  that  are 
displayed  in  the  various  GUI  windows. 

The  third  stage  consists  of  the  time  modification  programs  as  well  as  the  LPC 
speech  synthesizer  that  synthesizes  the  resulting  time-modified  speech.  Both  the  time 
modification  programs  and  the  synthesizer  are  controlled  by  a  GUI.  The  time  modification 
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programs  are  extremely  flexible,  and  allow  the  user  to  specify  both  the  desired  scale  factor 
(SF)  and  die  minimum  duration  (MD)  for  each  segment  of  the  speech  signal.  A  weighting 
function,  or  "map,"  can  also  be  specified  for  each  segment.  The  map  determines  which 
frames  in  the  segment  are  removed  or  duplicated  during  the  synthesis  process.  The  user 
can  select  one  of  five  fixed  maps,  or  he/she  can  create  and  edit  a  customized  map  for  precise 
control.  All  of  the  user-specified  parameters  are  controlled  via  a  mouse  by  selecting  and 
adjusting  push-buttons  and  sliders  that  are  displayed  in  the  GUI  windows. 

6.1.2  The  Listening  Tests 

This  study  performed  both  informal  listening  tests  (pilot  studies)  and  a  formal 
listening  test.    They  are  summarized  in  the  following  paragraphs. 

The  pilot  studies  were  performed  by  modifying  the  duration  of  the  initial  consonant 
in  various  words  from  the  Diagnostic  Rhyme  Test  (DRT)  spoken  by  one  female  and  two 
male  speakers.  The  resulting  time-modified  tokens  were  presented  over  headphones  to 
three  listeners.  The  results  of  these  tests  led  to  several  conclusions:  The  first  was  that  the 
system  was  capable  of  creating  high  quality  speech  tokens.  The  listeners  reported  that  in 
all  instances,  the  time-modified  tokens  were  indistinguishable  from  the  original  speech, 
insofar  as  the  quality  was  concerned.  This  was  attributed  both  to  the  glitch  prevention 
algorithm,  and  to  the  use  of  the  original  LPC  residue  as  die  excitation  signal  during  the 
synthesis  process. 

The  second  conclusion  that  was  reached  in  the  pilot  studies  was  that  the  perception 
of  the  time-modified  initial  consonant  was  a  function  of  several  factors.  These  factors 
included  (1)  the  duration  of  the  time-modified  consonant,  (2)  the  "position,"  or  portion  of 
the  original  consonant  that  was  preserved,  and  (3)  the  category  of  speech  sound  that  was 
modified  (i.e.  stop  versus  fricative).  In  most  cases,  the  perception  of  the  identity  of  the 
initial  consonant  shifted  as  the  consonant  duration  was  decreased  from  100%  to  0%  of  its 
original  value.  However,  the  rate  at  which  perception  shifted  was  a  function  of  the  category 
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of  phoneme  that  was  modified.  For  example,  both  nasals  and  voiced  firicatives  remained 
relatively  unaffected  as  their  duration  was  decreased,  while  voiced  and  unvoiced  stops 
were  affected  significantly.  In  addition,  for  a  given  duration,  the  portion  of  the  consonant 
that  was  preserved  had  an  effect  upon  perception. 

The  formal  listening  tests  were  conducted  using  270  tokens  that  were  created  by 
modifying  the  duration  of  the  initial  consonant  in  four  words  from  the  DRT.  The  four  words 
that  were  modified  were  "sue,"  "zoo,"  "said,"  and  "zed."  The  duration  of  the  initial 
consonant  was  increased  in  10  ms  increments  from  0%  to  100%  of  the  original  consonant 
duration.  For  each  word,  and  for  each  duration,  three  tokens  were  created:  The  first 
preserved  the  beginning  of  the  initial  consonant,  the  second  preserved  the  middle  of  the 
initial  consonant,  and  the  third  preserved  the  end  of  the  initial  consonant 

The  listening  tests  were  automated,  and  administered  using  a  Sun  IPX  UNIX 
workstation.  The  270  test  tokens  were  presented  over  Sony  MDR-CD888  headphones  at 
a  comfortable  listening  level,  at  a  rate  of  approximately  one  token  every  4.5  seconds. 
Approximately  one  second  before  each  token  was  presented,  a  list  of  nine  word  choices  was 
displayed  on  the  computer's  CRT  display.  A  push-button  was  also  displayed  below  each 
of  the  nine  choices.  The  listener  was  instructed  to  select  (via  the  mouse)  the  push-button 
below  the  word  that  most  closely  matched  the  token  that  was  heard. 

The  set  of  possible  answers  for  each  of  the  two  word  pairs  ("sue/zoo"  and 
"said/zed")  contained  words  with  the  same  list  of  initial  consonants.  Each  answer  differed 
from  the  original,  unmodified  word  only  in  the  initial  phoneme.  The  set  of  possible 
answers  for  the  words  "sue"  and  "zoo"  were:  "boo,"  "doo,"  "lou,"  "ooh,"  "poo,"  "sue," 
"two,"  "thoo,"  and  "zoo."  Likewise,  the  set  of  possible  answers  for  the  words  "said"  and 
"zed"  were:  "bed,"  "dead,"  "ed,"  "led,"  "ped,"  "said,"  "ted,"  "thed,"  and  "zed."  Note  that 
all  of  the  words  in  a  given  list  rhymed  with  one  another. 

The  listeners  were  both  students  and  faculty  from  the  Department  of 
Communication  Processes  and  Disorders  at  the  University  of  Florida.  Each  listener  was 
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paid  $15.  A  total  of  29  listeners  (24  female  and  5  male)  took  the  test.  Each  listener  took 
the  test  twice — once  on  one  day,  and  once  on  a  second  day.  This  was  done  to  screen  for 
possible  intra-listener  variability,  or  inconsistency.  Two  of  the  listeners  were  rejected  by 
this  screening  process.  Therefore,  the  results  were  presented  for  a  total  of  27  listeners  (23 
female  and  4  male). 

The  results  led  to  several  conclusions.  The  first  was  that  the  duration  of  the 
time-modified  consonant  influenced  the  perception  of  the  identity  of  the  consonant.  In 
general,  as  the  duration  of  the  Mcative  was  increased  from  0%  to  100%  of  its  original  value, 
a  series  of  initial  consonants  was  perceived.  At  relatively  short  durations  (typically  less 
than  50  ms),  the  initial  consonant  was  often  perceived  as  either  a  weak  fricative  or  a  voiced 
stop.  For  slightly  longer  durations,  the  initial  consonant  was  perceived  in  many  instances 
as  either  an  unvoiced  stop  or  a  liquid.  As  the  duration  was  increased  further,  for  each  of 
the  four  original  words,  the  perception  of  the  identity  of  the  initial  consonant  shifted  to  that 
of  the  original,  unmodified  consonant.  The  "threshold,"  or  the  minimum  duration  required 
for  consistent  and  accurate  perception  of  the  original  consonant  in  each  word  was  also 
examined.  An  interesting  result  was  that  this  threshold  was  observed  to  be  dependent  upon 
the  duration  of  the  time-modified  consonant  as  well  as  the  unmodified  duration  of  the 
following  vowel. 

A  second  conclusion  that  resulted  from  the  formal  listening  test  was  that  the 
"position,"  or  portion  of  the  initial  consonant  that  was  preserved  often  affected  the 
perception  of  the  consonant.  The  perceived  phoneme  category  was  most  affected  by 
position.  It  was  shown  that  artificial  cues  for  the  presence  of  a  stop  consonant  were  created 
when  the  middle  or  end  portion  of  the  initial  (fricative)  consonant  was  preserved.  These 
cues  were  attributed  to  the  abrupt  onset  of  the  time-modified  consonant,  which  closely 
resembled  the  release  of  stored  energy  normally  exhibited  by  a  naturally-produced  stop. 
Conversely,  cues  for  a  stop  were  not  created  when  the  beginning  portion  of  the  original 
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consonant  was  preserved.  This  was  because  the  energy  contour  was  increasing  for  this 
portion  of  the  fricative,  instead  of  decreasing  as  it  would  during  the  release  in  a  stop. 

In  certain  instances,  the  perceived  place  of  articulation  was  also  affected  by  the 
portion  of  the  original  consonant  that  was  preserved.  When  this  effect  was  examined 
closely,  it  was  determined  that  this  result  was  caused  by  variations  in  the  spectral  content 
of  the  original  fricatives  as  a  function  of  time.  For  example,  in  the  original  word  "zoo," 
the  frication  noise  was  largely  absent  from  the  beginning  and  middle  of  the  /z/,  and  was 
concentrated  primarily  at  the  end  of  the  M.  Thus,  the  tokens  that  were  created  with  either 
the  beginning  or  the  middle  portion  of  the  unmodified  M  were  usually  not  perceived  as 
fricatives,  due  to  the  lack  of  frication.  In  other  examples,  the  perceived  place  of  articulation 
of  a  voiced  stop  was  directiy  attributable  to  the  center  frequency  of  the  simulated  "noise 
burst"  that  was  heard.  Note,  however,  that  while  position  was  shown  to  affect  the  perceived 
place  of  articulation,  the  results  did  not  prove  that  this  effect  was  independent  of  the  spectral 
content  of  the  signal. 

Another  interesting  result  that  was  observed  was  that  there  were  definite  and 
sudden  shifts  in  the  perception  of  the  identity  of  the  initial  consonant  as  the  duration  was 
changed  over  a  relatively  small  range  of  values.  This  result  supported  the  theory  that 
humans  use  "categorical  perception"  in  processing  speech  (Liberman  et  al.,  1957). 
Although  this  concept  is  beyond  the  scope  of  this  study,  the  theory  is  based  upon  the  idea 
that  as  the  value  of  an  acoustic  feature  is  changed  over  a  continuum,  there  are  specific 
ranges  of  values  over  which  humans  are  unable  to  detect  a  perceptual  change.  If  the  value 
of  the  acoustic  featiu-e  is  varied  across  the  boundary  between  two  of  these  ranges,  there  is 
a  distinct  change  in  perception.  However,  as  long  as  the  value  of  the  acoustic  feature  is  kept 
within  a  single  range,  perception  remains  relatively  constant. 
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6.2  Recommendations  for  Further  Work 

This  section  outiines  recommendations  for  further  work  involving  the  time 
modification  system.  The  recommendations  are  grouped  according  to  (1)  additional 
listening  tests  and  (2)  enhancements  to  the  time  modification  software.  They  are  discussed 
separately  in  the  following  sections. 

6.2.1  Additional  Listening  Tests 

One  of  the  main  goals  of  this  study  was  to  develop  a  system  that  allows  quick  and 
easy  development  of  time-modified  speech  tokens  for  use  in  studies  of  human  perception. 
Therefore,  it  is  an  obvious  suggestion  that  additional  listening  tests  should  be  conducted. 

However,  there  are  specific  listening  tests  that  could  be  conducted  to  further 
investigate  the  conclusions  that  were  reached  in  this  study.  They  are  now  discussed.  One 
of  the  most  important  conclusions  was  that  the  threshold  for  consistent  (and  accurate) 
perception  of  the  initial  consonant  in  single-syllable  CV  and  CVC  words  was  dependent 
upon  the  durations  of  both  the  consonant  and  the  following  vowel.  This  could  be  further 
investigated  by  using  the  time  modification  system  to  create  a  series  of  CV  and  CVC 
syllables  with  varying  consonant  and  vowel  durations.  Unlike  naturally-produced  tokens, 
the  durations  of  the  consonant  and  vowel  segments  could  be  precisely  controlled,  thus 
allowing  tokens  with  a  wide  variety  of  durations  to  be  created  and  tested. 

Another  test  that  could  be  performed  would  involve  further  testing  of  the 
conclusion  that  the  liquid  /I/  was  perceived  instead  of  the  fiicative  M  due  to  the  lack  of 
ftication  noise.  This  would  require  collection  and  modification  of  numerous  natural  speech 
tokens  with  M  in  the  word-initial  position.  If  natural  tokens  that  exhibited  a  relatively 
uniform  distribution  of  ftication  over  time  were  collected  and  modified,  the  conclusions 
of  the  present  study  could  be  tested  further. 


214 


The  last  test  that  is  suggested  is  to  investigate  the  conclusion  that  the  perception  of 
stops  versus  weak  fricatives  was  a  result  of  the  different  RMS  power  levels  of  the 
unmodified,  initial  consonants.  This  test  could  be  accomplished  by  modifying  the  duration 
of  the  initial  consonant  in  both  weak  and  strong  unvoiced  fricatives. 

6.2.2  Enhancements  to  the  Time  Modification  System 

The  time  modification  system  cuirentiy  meets  the  original  design  specifications. 
It  accomplishes  time  modification  of  speech  in  a  convenient,  easy-to-use,  flexible  manner, 
without  the  use  of  graphics-based  waveform  editors.  It  also  allows  the  user  to  modify  the 
signal  based  upon  parameters  that  are  familiar  to  the  speech  researcher.  And  most 
important,  the  resulting  synthesized  speech  is  virtually  indistinguishable  from  the  original 
speech,  insofar  as  quality  or  "naturalness"  is  concerned. 

A  logical  extension  to  the  time  modification  system  is  to  expand  the  number  of 
features  associated  with  the  speech  signal  that  can  be  modified.  While  there  are  numerous 
features  that  could  be  modified,  the  researchers  are  most  interested  in  expanding  the 
capabilities  of  the  system  to  control  the  features  typically  associated  with  prosody,  i.e.  the 
rhythm  and  tonal  patterns  of  the  speech  signal  (Borden  and  Harris,  1984).  In  general, 
prosody  is  manifested  by  changes  in  (1)  the  fundamental  frequency,  or  pitch,  (2)  the 
loudness,  and  (3)  the  durations  of  the  individual  segments  that  comprise  the  speech  signal. 
Note  that  the  system  is  already  capable  of  modifying  the  segment  durations. 

In  order  to  control  the  loudness  of  the  individual  speech  segments  in  the  speech 
signal,  the  segments  must  first  be  identified  and  then  either  amplified  or  attenuated.  Since 
the  software  currentiy  identifies  the  speech  segments  during  the  segmentation  and  labeling 
stage,  the  only  remaining  task  is  to  adjust  the  gain  of  the  individual  segments.  This  could 
be  accomplished  by  assigning  a  "gain"  variable  to  each  segment.  Note  that  to  avoid 
discontinuities,  or  "gUtches,"  at  the  boundary  between  segments,  the  gain  would  have  to 
be  automatically  interpolated  by  the  LPC  speech  synthesizer.   The  gain  could  also  be 
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assigned  on  a  global  basis,  much  in  the  same  way  that  the  scale  factor  (SF)  and  minimum 
duration  (MD)  parameters  are  assigned.  Thus,  the  user  could  increase  the  gain  of  all 
occuirences  of  a  particular  segment  type  (i.e.  nasals)  with  one  parameter.  The  inclusion 
of  eight  gain  factors,  one  for  each  segment  type,  would  require  only  one  additional  window 
and  relatively  minor  modification  to  the  LPC  synthesizer  software. 

Modification  of  the  fundamental  fi-equency  contour,  on  the  other  hand,  is  more 
complicated  than  modification  of  the  gain.  The  reason  for  this  is  that  the  pitch  period  would 
have  to  be  either  lengthened  (or  shortened)  by  a  varying  amount  for  each  voiced  firame. 
Thus,  the  excitation  signal  would  also  have  to  be  lengthened  (or  shortened).  It  appears  that 
shortening  would  be  relatively  easy — one  possible  method  would  be  to  simply  truncate  the 
excitation.  However,  lengthening  is  more  complicated.  This  is  because  the  system  would 
be  forced  to  extend  the  residue/excitation  signal  for  each  frame.  This  would  involve 
creating  a  noise-like  sequence  that  closely  resembles  the  final  portion  of  the  residue  from 
each  LPC  analysis  frame  without  creating  audible  artifacts. 

Modification  of  the  pitch  contour  is  further  complicated  by  the  fact  that  the  number 
of  ft^mes  in  each  segment  would  change,  provided  that  the  segment  length  was  kept 
constant  as  the  pitch  was  modified.  Inevitably,  this  would  require  interpolation  of  the  LPC 
filter  parameters  for  some  or  all  of  the  frames,  a  process  that  could  easily  introduce 
significant  distortion  in  the  synthesized  speech  waveform. 

However,  one  solution  that  could  possibly  accomplish  pitch  change  without 
requiring  interpolation  of  the  LPC  filter  coefficients  is  as  follows:  The  first  step  would 
modify  the  pitch  contour  without  changing  the  number  of  frames  in  a  given  segment.  This 
would  result  in  a  segment  that  was  either  longer  or  shorter  than  the  original,  depending  on 
whether  the  average  pitch  period  was  either  increased  or  decreased,  respectively.  Note  that 
while  this  step  would  require  modification  of  the  residue  signal  (as  previously  discussed), 
it  would  fiot  involve  interpolation  of  the  filter  coefficients.  The  second  step  in  the  process 
would  utilize  the  existing  time  modification  software  to  adjust  the  duration  of  the 
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previously  lengthened  (or  shortened)  segment  to  be  equal  to  the  original  segment  length. 
This  would  involve  discarding  (or  repeating)  select  frames  from  the  segment.  If  the  frames 
were  selected  at  equally-spaced  intervals,  it  is  possible  that  the  resulting  speech  would 
contain  no  noticeable  distortion. 


APPENDIX  A 
THE  DIAGNOSTIC  RHYME  TEST  WORD  UST 


Voicing 


Nasality 


bean 

gin 

dint 

III 

zoo 

sue 

dune 

tune 

voal 

foal 

goat 
zed 

coat 
said 

dense 

tense 

vast 

fast 

gaff 
vault 

calf 
fault 

daunt 

taunt 

jock 
bond 

chock 
pond 

Sibilation 

zee 

thee 

cheep 

jilt 
sing 

keep 
guilt 
thing 

juice 
chew 

goose 
coo 

Joe 
sole 

go 
thole 

jest 
chair 

guest 
care 

jab 
sank 

dab 
thank 

jaws 
saw 

gauze 
5iaw 

jot 
chop 

got 
cop 

meat 

beat 

need 

deed 

mitt 

bit 

nip 

dip 

moot 

boot 

news 

dues 

moan 

bone 

note 

dote 

mend 

bend 

neck 

deck 

mad 

bad 

nab 

dab 

moss 

boss 

gnaw 

daw 

mom 

bomb 

knock 

dock 

Graveness 


weed 

reed 

peak 
bid 

teak 
did 

fin 

thin 

moon 

noon 

pool 
bowl 

tool 
dole 

fore 

thor 

met 

net 

pent 
bank 

tent 
dank 

fad 

thad 

fought 

bong 

wad 

thought 

dong 

rod 

pot 

tot 

Sustention 

vee 

bee 

sheet 

cheat 

vill 

bill 

thick 

tick 

foo 
shoes 

pooh 
choose 

those 

doze 

though 
then 

dough 
den 

fence 
than 

pence 
Dan 

shad 

chad 

thong 
shaw 

tong 
chaw 

v<m 

bon 

vox 

box 

Compacmess 

yield 

key 

hit 

wield 

tea 

fit 

gill 

dill 

coop 

poop 

you 

ghost 

show 

rue 

boast 

so 

keg 

peg 

yen 

gat 

shag 

yawl 

caught 

hop 

got 

wren 

bat 

sad 

wall 

taught 

fop 

dot 
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APPENDIX  B 
LISTENING  TEST  INSTRUCTIONS 


This  is  a  listening  test.  You  will  use  headphones  to  listen  to  a  series  of 
single-syllable  test  words.  Your  task  is  to  use  the  computer  mouse  to  select  one  of  the  nine 
words  displayed  on  the  computer  screen  that  best  matches  each  token  that  you  hear. 

The  nine  words,  or  choices,  are  displayed  visually  on  the  computer  screen 
approximately  one  second  before  the  token  is  played.  In  addition,  there  are  push-buttons 
displayed  below  the  words.  To  select  one  of  the  nine  words,  use  the  mouse  to  position  the 
cursor  over  the  push-button  below  the  word  you  have  selected,  and  then  click  the  left  mouse 
button.  Once  a  word  is  selected,  all  of  the  other  words  that  are  displayed  will  turn  light  gray. 
The  word  that  you  have  selected  wUl  remain  white. 

As  the  computer  system  prepares  to  play  the  next  test  word,  it  re-displays  the  list 
of  nine  word  choices,  and  resets  the  color  of  all  of  the  displayed  words  to  white.  This 
indicates  that  the  next  word  token  will  be  played  in  approximately  one  second. 

The  test  words  are  played  at  a  rate  of  approximately  one  word  every  4.5  seconds. 
There  is  a  five  second  pause  after  every  ten  test  words.  There  is  also  a  five  minute  break 
after  every  90  test  words.  When  these  pauses  and  breaks  occiu",  instructions  will  appear 
in  the  test  display.  The  test  will  not  stop,  other  than  for  these  planned  pauses  and  breaks. 

You  must  select  only  one  of  the  nine  choices  for  each  test  word  that  you  hear.  If 
the  test  word  sounds  similar  to  two  (or  more)  of  the  choices  that  are  displayed,  select  the 
choice  that  most  closely  matches  the  test  word.  Note  that  while  you  will  not  be  penalized 
if  you  fail  to  make  a  choice  in  the  allotted  time,  it  is  extremely  important  that  you  pick  an 
answer,  even  if  you  think  that  your  choice  is  not  a  "good"  match.  Also  note  that  if  you 
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change  your  mind  before  time  has  run  out  for  a  given  test  word,  you  can  change  your  answer 
by  selecting  your  new  choice  with  the  mouse. 

In  the  test,  there  are  two  separate  sets  of  nine  word  choices  that  are  displayed.  The 
first  set  is  [  bed,  dead,  led,  ed,  ped,  said,  ted,  thed,  zed  ],  and  the  second  set  is  [  boo,  doo, 
lou,  ooh,  poo,  sue,  two,  thoo,  zoo  ].  These  sets  alternate  back  and  forth  at  random.  Note 
that,  for  the  most  part,  the  words  in  these  sets  are  arranged  and  displayed  alphabetically. 
This  fact  may  be  used  to  help  you  to  quickly  find  your  choice  once  a  test  word  has  been 
played.  For  example,  if  you  hear  the  word  "zoo,"  you  can  accelerate  your  search  by 
concentrating  on  the  right  side  of  the  computer  display  while  searching  for  the  best  match, 
since  "z"  is  the  last  letter  of  the  alphabet. 

You  are  encouraged  to  practice  taking  the  test  (for  as  long  as  you  like)  to  get 
accustomed  to  the  test  words  and  the  motor  demands  of  the  testing  system.  If  you  have  any 
questions,  feel  ftee  to  ask. 


APPENDIX  C 
FORMAL  LISTENING  TEST  RESULTS 


The  numerical  results  for  the  formal  listening  tests  are  listed  in  the  following  twelve 
tables.  The  format  is  as  follows:  The  data  in  each  table  arc  airanged  in  ten  columns.  The 
first  column  lists  the  duration  of  the  time-modified  consonant  (in  ms).  The  remaining 
columns  list,  for  each  duration,  the  percentage  of  listeners  that  chose  the  word  listed  in  the 
column  heading.  For  example,  in  the  first  table.  Table  C-1, 50.00%  of  the  listeners  heard 
the  word  "boo"  when  listening  to  the  token  that  preserved  the  first  10  ms  of  the  /s/  in  the 
word  "sue." 
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Table  C-1 .     Results  for  the  tokens  that  preserved  the  beginning 
portions  of  the  /s/  in  the  word  "sue." 


ms 


boo 


doo 


lou 


ooh 


poo 


sue 


two 


thoo 


zoo 


0 

35.19 

3.70 

1.85 

55.56 

0.00 

0.00 

10 

50.00 

1.85 

1.85 

42.59 

0.00 

0.00 

20 

35.19 

1.85 

1.85 

50.00 

3.70 

0.00 

30 

7.41 

9.26 

1.85 

62.96 

1.85 

1.85 

40 

0.00 

11.11 

7.41 

38.89 

0.00 

0.00 

50 

3.70 

3.70 

3.70 

25.93 

0.00 

0.00 

60 

0.00 

14.81 

0.00 

22.22 

0.00 

1.85 

70 

0.00 

5.56 

0.00 

11.11 

0.00 

3.70 

80 

0.00 

1.85 

0.00 

7.41 

0.00 

14.81 

90 

0.00 

1.85 

0.00 

3.70 

1.85 

16.67 

100 

0.00 

0.00 

0.00 

3.70 

0.00 

33.33 

110 

0.00 

1.85 

0.00 

0.00 

0.00 

53.70 

120 

0.00 

0.00 

0.00 

1.85 

0.00 

61.11 

130 

0.00 

0.00 

0.00 

0.00 

0.00 

72.22 

140 

0.00 

0.00 

0.00 

0.00 

0.00 

85.19 

150 

0.00 

0.00 

0.00 

0.00 

0.00 

79.63 

160 

0.00 

0.00 

0.00 

0.00 

0.00 

83.33 

170 

0.00 

0.00 

0.00 

0.00 

0.00 

94.44 

180 

0.00 

0.00 

0.00 

0.00 

0.00 

94.44 

190 

0.00 

0.00 

0.00 

0.00 

0.00 

96.30 

200 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

210 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

220 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

230 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

240 

0.00 

0.00 

0.00 

0.00 

0.00 

98.15 

0.00  3.70  0.00 

0.00  3.70  0.00 

0.00  7.41  0.00 

0.00  12.96  1.85 

0.00  35.19  7.41 

0.00  44.44  18.52 

0.00  22.22  38.89 

1.85  3.70  74.07 

3.70  1.85  70.37 

0.00  1.85  74.07 

0.00  0.00  62.96 

0.00  0.00  44.44 

1.85  1.85  33.33 

0.00  0.00  27.78 

0.00  0.00  14.81 

0.00  0.00  20.37 

0.00  1.85  14.81 

0.00  0.00  5.56 

0.00  1.85  3.70 

0.00  0.00  3.70 

0.00  0.00  0.00 

0.00  0.00  0.00 

0.00  0.00  0.00 

0.00  0.00  0.00 

0.00  1.85  0.00 
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Table  C-2.    Results  for  the  tokens  that  preserved  the  middle  portions 
of  the  /s/  in  the  word  "sue." 

ms       boo        doo        lou         ooh        poo  sue  two  thoo  zoo 

0.00  1.85  0.00  1.85  3.70 

0.00  0.00  0.00  7.41  0.00 

1.85  0.00  3.70  1.85  0.00 

0.00  0.00  12.96  0.00  11.11 

0.00  0.00  31.48  5.56  11.11 

0.00  3.70  46.30  7.41  7.41 

0.00  0.00  48.15  5.56  25.93 

0.00  9.26  48.15  0.00  29.63 

0.00  14.81  40.74  3.70  33.33 

0.00  7.41  57.41  3.70  20.37 

0.00  59.26  14.81  0.00  18.52 

0.00  53.70  24.07  1.85  14.81 

0.00  75.93  11.11  0.00  12.96 

0.00  85.19  11.11  1.85  1.85 

0.00  85.19  5.56  0.00  9.26 

0.00  92.59  5.56  1.85  0.00 

0.00  83.33  7.41  0.00  9.26 

0.00  90.74  5.56  1.85  1.85 

0.00  92.59  3.70  0.00  3.70 

0.00  96.30  3.70  0.00  0.00 

0.00  96.30  1.85  0.00  1.85 

0.00  96.30  3.70  0.00  0.00 

0.00    100.00  0.00  0.00  0.00 

0.00  98.15  0.00  0.00  1.85 

0.00  98.15  0.00  0.00  1.85 


0 

27.78 

1.85 

0.00 

62.96 

10 

0.00 

87.04 

0.00 

5.56 

20 

0.00 

87.04 

0.00 

5.56 

30 

0.00 

74.07 

0.00 

1.85 

40 

0.00 

48.15 

0.00 

3.70 

50 

0.00 

31.48 

0.00 

3.70 

60 

0.00 

18.52 

0.00 

1.85 

70 

0.00 

7.41 

1.85 

3.70 

80 

0.00 

5.56 

0.00 

1.85 

90 

0.00 

9.26 

0.00 

1.85 

100 

0.00 

1.85 

0.00 

5.56 

110 

0.00 

3.70 

0.00 

1.85 

120 

0.00 

0.00 

0.00 

0.00 

130 

0.00 

0.00 

0.00 

0.00 

140 

0.00 

0.00 

0.00 

0.00 

150 

0.00 

0.00 

0.00 

0.00 

160 

0.00 

0.00 

0.00 

0.00 

170 

0.00 

0.00 

0.00 

0.00 

180 

0.00 

0.00 

0.00 

0.00 

190 

0.00 

0.00 

0.00 

0.00 

200 

0.00 

0.00 

0.00 

0.00 

210 

0.00 

0.00 

0.00 

0.00 

220 

0.00 

0.00 

0.00 

0.00 

230 

0.00 

0.00 

0.00 

0.00 

240 

0.00 

0.00 

0.00 

0.00 
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Table  C-3.     Results  for  the  tokens  that  preserved  the  end  portions  of 
the  /s/  in  the  word  "sue." 


ms 


boo 


doo 


lou 


ooh 


poo 


sue 


two 


thoo 


zoo 


0 

31.48 

5.56 

1.85 

57.41 

0.00 

0.00 

0.00 

3.70 

0.00 

10 

46.30 

1.85 

3.70 

40.74 

0.00 

1.85 

0.00 

5.56 

0.00 

20 

40.74 

40.74 

0.00 

12.96 

1.85 

0.00 

0.00 

3.70 

0.00 

30 

5.56 

64.81 

1.85 

1.85 

1.85 

0.00 

12.96 

9.26 

1.85 

40 

0.00 

61.11 

0.00 

0.00 

0.00 

0.00 

37.04 

1.85 

0.00 

50 

0.00 

24.07 

0.00 

0.00 

0.00 

0.00 

74.07 

1.85 

0.00 

60 

0.00 

18.52 

0.00 

0.00 

0.00 

0.00 

79.63 

1.85 

0.00 

70 

0.00 

1.85 

0.00 

0.00 

0.00 

0.00 

94.44 

1.85 

1.85 

80 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

94.44 

1.85 

3.70 

90 

0.00 

1.85 

0.00 

0.00 

0.00 

0.00 

96.30 

0.00 

1.85 

100 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

94.44 

0.00 

5.56 

110 

0.00 

0.00 

0.00 

0.00 

0.00 

16.67 

79.63 

0.00 

3.70 

120 

0.00 

0.00 

0.00 

0.00 

0.00 

16.67 

81.48 

0.00 

1.85 

130 

0.00 

0.00 

0.00 

0.00 

0.00 

33.33 

59.26 

0.00 

7.41 

140 

0.00 

0.00 

0.00 

1.85 

0.00 

55.56 

38.89 

0.00 

3.70 

150 

0.00 

0.00 

0.00 

0.00 

0.00 

55.56 

44.44 

0.00 

0.00 

160 

0.00 

0.00 

0.00 

0.00 

0.00 

79.63 

16.67 

3.70 

0.00 

170 

0.00 

1.85 

0.00 

0.00 

0.00 

88.89 

7.41 

1.85 

0.00 

180 

0.00 

0.00 

0.00 

0.00 

0.00 

87.04 

12.96 

0.00 

0.00 

190 

0.00 

0.00 

0.00 

0.00 

0.00 

94.44 

3.70 

0.00 

1.85 

200 

0.00 

0.00 

0.00 

0.00 

0.00 

92.59 

3.70 

1.85 

1.85 

210 

0.00 

0.00 

0.00 

0.00 

0.00 

88.89 

9.26 

1.85 

0.00 

220 

0.00 

0.00 

0.00 

0.00 

0.00 

94.44 

3.70 

1.85 

0.00 

230 

0.00 

0.00 

0.00 

0.00 

0.00 

96.30 

1.85 

0.00 

1.85 

240 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

0.00 

0.00 

0.00 

224 


Table  C^.     Results  for  the  tokens  that  preserved  the  beginning 
portions  of  the  /z/  in  the  word  "zoo." 

ms       boo        doo         lou         ooh        poo         sue         two  thoo  zoo 

0.00  11.11  0.00 

0.00  11.11  1.85 

0.00  9.26  1.85 

0.00  9.26  0.00 

0.00  7.41  1.85 

0.00  14.81  0.00 

1.85  16.67  1.85 

0.00  12.96  1.85 

0.00  25.93  3.70 

0.00  14.81  1.85 

0.00  20.37  14.81 

0.00  14.81  7.41 

0.00  18.52  5.56 

0.00  22.22  12.96 

0.00  12.96  16.67 

0.00  12.96  18.52 

0.00  5.56  29.63 

0.00  9.26  61.11 

1.85  5.56  53.70 

0.00  7.41  53.70 

0.00  3.70  64.81 

0.00  0.00  81.48 

0.00  0.00  79.63 

0.00  3.70  72.22 

0.00  0.00  83.33 


0 

7.41 

1.85 

12.96 

66.67 

0.00 

0.00 

10 

5.56 

0.00 

16.67 

62.96 

1.85 

0.00 

20 

7.41 

0.00 

16.67 

62.96 

1.85 

0.00 

30 

3.70 

0.00 

33.33 

53.70 

0.00 

0.00 

40 

3.70 

0.00 

53.70 

33.33 

0.00 

0.00 

50 

1.85 

0.00 

57.41 

25.93 

0.00 

0.00 

60 

1.85 

1.85 

64.81 

11.11 

0.00 

0.00 

70 

0.00 

0.00 

74.07 

9.26 

0.00 

1.85 

80 

0.00 

0.00 

64.81 

3.70 

0.00 

1.85 

90 

0.00 

0.00 

75.93 

7.41 

0.00 

0.00 

100 

0.00 

0.00 

61.11 

3.70 

0.00 

0.00 

110 

0.00 

0.00 

70.37 

5.56 

0.00 

1.85 

120 

1.85 

0.00 

70.37 

1.85 

0.00 

1.85 

130 

0.00 

0.00 

61.11 

0.00 

0.00 

3.70 

140 

0.00 

0.00 

70.37 

0.00 

0.00 

0.00 

150 

0.00 

0.00 

66.67 

0.00 

1.85 

0.00 

160 

0.00 

0.00 

62.96 

0.00 

0.00 

1.85 

170 

0.00 

3.70 

25.93 

0.00 

0.00 

0.00 

180 

0.00 

0.00 

38.89 

0.00 

0.00 

0.00 

190 

0.00 

0.00 

38.89 

0.00 

0.00 

0.00 

200 

0.00 

0.00 

29.63 

0.00 

0.00 

1.85 

210 

0.00 

0.00 

18.52 

0.00 

0.00 

0.00 

220 

0.00 

0.00 

20.37 

0.00 

0.00 

0.00 

230 

0.00 

0.00 

24.07 

0.00 

0.00 

0.00 

240 

0.00 

0.00 

16.67 

0.00 

0.00 

0.00 
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Table  C-5.    Results  for  the  tokens  that  preserved  the  middle  portions 
of  the  /z/  in  the  word  "zoo." 


ms 

boo 

doo 

lou 

ooh 

poo 

sue 

0 

7.41 

0.00 

12.96 

70.37 

0.00 

0.00 

10 

9.26 

9.26 

42.59 

25.93 

1.85 

0.00 

20 

1.85 

22.22 

37.04 

20.37 

0.00 

0.00 

30 

3.70 

16.67 

48.15 

5.56 

1.85 

0.00 

40 

0.00 

5.56 

57.41 

5.56 

0.00 

0.00 

50 

3.70 

3.70 

35.19 

5.56 

0.00 

0.00 

60 

0.00 

0.00 

55.56 

0.00 

0.00 

0.00 

70 

0.00 

0.00 

46.30 

0.00 

0.00 

0.00 

80 

0.00 

0.00 

46.30 

1.85 

0.00 

0.00 

90 

0.00 

0.00 

53.70 

1.85 

0.00 

0.00 

100 

0.00 

0.00 

57.41 

1.85 

0.00 

0.00 

110 

0.00 

0.00 

31.48 

0.00 

0.00 

0.00 

120 

0.00 

0.00 

61.11 

0.00 

0.00 

0.00 

130 

0.00 

0.00 

35.19 

0.00 

0.00 

0.00 

140 

0.00 

0.00 

37.04 

1.85 

0.00 

1.85 

150 

0.00 

0.00 

38.89 

0.00 

0.00 

1.85 

160 

0.00 

0.00 

33.33 

0.00 

0.00 

0.00 

170 

0.00 

0.00 

22.22 

0.00 

0.00 

0.00 

180 

0.00 

0.00 

40.74 

0.00 

0.00 

0.00 

190 

0.00 

0.00 

29.63 

0.00 

0.00 

0.00 

200 

0.00 

0.00 

16.67 

0.00 

0.00 

0.00 

210 

0.00 

0.00 

9.26 

0.00 

0.00 

0.00 

220 

0.00 

0.00 

24.07 

0.00 

0.00 

0.00 

230 

0.00 

1.85 

11.11 

0.00 

0.00 

0.00 

240 

0.00 

0.00 

9.26 

0.00 

0.00 

0.00 

two  thoo  zoo 

0.00  7.41  1.85 

0.00  11.11  0.00 

1.85  16.67  0.00 

0.00  16.67  7.41 

0.00  16.67  14.81 

0.00  14.81  37.04 

0.00  16.67  27.78 

0.00  12.96  40.74 

0.00  3.70  48.15 

0.00  12.96  31.48 

0.00  7.41  33.33 

0.00  1.85  66.67 

0.00  9.26  29.63 

0.00  7.41  57.41 

0.00  3.70  55.56 

1.85  5.56  51.85 

0.00  5.56  61.11 

0.00  1.85  75.93 

0.00  1.85  57.41 

0.00  3.70  66.67 

1.85  5.56  75.93 

0.00  0.00  90.74 

0.00  5.56  70.37 

0.00  0.00  87.04 

0.00  0.00  90.74 
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Table  C-6.    Results  for  the  tokens  that  preserved  the  end  portions  of 
the  Izl  in  the  word  "zoo." 


ms 


boo 


doo 


lou 


ooh 


poo 


sue 


two 


thoo 


zoo 


0 

7.41 

0.00 

14.81 

62.96 

1.85 

0.00 

0.00 

11.11 

1.85 

10 

5.56 

0.00 

24.07 

61.11 

0.00 

0.00 

0.00 

9.26 

0.00 

20 

12.96 

0.00 

33.33 

46.30 

1.85 

0.00 

0.00 

5.56 

0.00 

30 

9.26 

1.85 

33.33 

40.74 

0.00 

0.00 

0.00 

14.81 

0.00 

40 

5.56 

5.56 

31.48 

37.04 

0.00 

1.85 

0.00 

11.11 

7.41 

50 

1.85 

29.63 

20.37 

7.41 

0.00 

1.85 

0.00 

22.22 

16.67 

60 

3.70 

11.11 

37.04 

7.41 

0.00 

0.00 

0.00 

14.81 

25.93 

70 

1.85 

14.81 

33.33 

3.70 

0.00 

0.00 

0.00 

5.56 

40.74 

80 

0.00 

18.52 

16.67 

1.85 

0.00 

0.00 

0.00 

11.11 

51.85 

90 

0.00 

5.56 

27.78 

0.00 

0.00 

0.00 

0.00 

1.85 

64.81 

100 

0.00 

1.85 

22.22 

0.00 

0.00 

0.00 

0.00 

3.70 

72.22 

110 

0.00 

1.85 

25.93 

0.00 

0.00 

0.00 

0.00 

5.56 

66.67 

120 

1.85 

0.00 

24.07 

0.00 

0.00 

0.00 

0.00 

5.56 

68.52 

130 

0.00 

0.00 

22.22 

0.00 

0.00 

0.00 

0.00 

1.85 

75.93 

140 

0.00 

1.85 

14.81 

3.70 

0.00 

0.00 

0.00 

1.85 

77.78 

150 

0.00 

0.00 

24.07 

0.00 

0.00 

0.00 

0.00 

3.70 

72.22 

160 

0.00 

0.00 

14.81 

0.00 

0.00 

1.85 

0.00 

1.85 

81.48 

170 

0.00 

0.00 

18.52 

0.00 

0.00 

0.00 

0.00 

3.70 

77.78 

180 

0.00 

0.00 

16.67 

0.00 

0.00 

1.85 

0.00 

0.00 

81.48 

190 

0.00 

0.00 

18.52 

0.00 

0.00 

0.00 

0.00 

1.85 

79.63 

200 

0.00 

0.00 

7.41 

0.00 

0.00 

0.00 

0.00 

1.85 

90.74 

210 

0.00 

1.85 

11.11 

0.00 

0.00 

0.00 

0.00 

0.00 

87.04 

220 

0.00 

0.00 

24.07 

0.00 

0.00 

0.00 

0.00 

1.85 

74.07 

230 

0.00 

0.00 

9.26 

0.00 

0.00 

0.00 

1.85 

0.00 

88.89 

240 

0.00 

0.00 

14.81 

0.00 

0.00 

0.00 

0.00 

1.85 

83.33 
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Table  C-7.     Results  for  the  tokens  that  preserved  the  beginning 
portions  of  the  /s/  in  the  word  "said." 


ms 

bed 

dead 

led 

ed 

ped 

said 

0 

14.81 

0.00 

1.85 

29.63 

0.00 

0.00 

10 

0.00 

3.70 

1.85 

25.93 

0.00 

0.00 

20 

0.00 

1.85 

0.00 

24.07 

0.00 

1.85 

30 

0.00 

0.00 

0.00 

12.96 

0.00 

0.00 

40 

0.00 

1.85 

5.56 

5.56 

0.00 

1.85 

50 

0.00 

0.00 

0.00 

9.26 

0.00 

11.11 

60 

0.00 

0.00 

0.00 

7.41 

1.85 

42.59 

70 

0.00 

0.00 

0.00 

5.56 

0.00 

55.56 

80 

0.00 

0.00 

0.00 

1.85 

0.00 

70.37 

90 

0.00 

0.00 

0.00 

0.00 

0.00 

68.52 

100 

0.00 

0.00 

0.00 

0.00 

0.00 

83.33 

no 

0.00 

0.00 

0.00 

0.00 

0.00 

90.74 

120 

0.00 

0.00 

0.00 

0.00 

0.00 

90.74 

130 

0.00 

0.00 

0.00 

0.00 

0.00 

98.15 

140 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

150 

0.00 

0.00 

0.00 

0.00 

0.00 

98.15 

160 

0.00 

0.00 

0.00 

0.00 

0.00 

96.30 

170 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

180 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

190 

0.00 

0.00 

0.00 

0.00 

0.00 

98.15 

200 

0.00 

0.00 

0.00 

0.00 

0.00 

98.15 

ted  thed  zed 

0.00  51.85  1.85 

0.00  66.67  1.85 

1.85  66.67  3.70 

0.00  62.96  24.07 

0.00  42.59  42.59 

0.00  44.44  35.19 

0.00  18.52  29.63 

0.00  14.81  24.07 

1.85  7.41  18.52 

0.00  18.52  12.96 

0.00  1.85  14.81 

0.00  3.70  5.56 

0.00  3.70  5.56 

0.00  0.00  1.85 

0.00  0.00  0.00 

0.00  1.85  0.00 

0.00  0.00  3.70 

0.00  0.00  0.00 

0.00  0.00  0.00 

0.00  1.85  0.00 

0.00  1.85  0.00 
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Table  C-8.    Results  for  the  tokens  that  preserved  the  middle  portions 
of  the  /s/  in  the  word  "said." 

ms  bed  dead        led          ed         ped        said        ted        thed        zed 

0  16.67  0.00 

10  0.00  18.52 

20  0.00  14.81 

30  0.00  1.85 

40  0.00  0.00 

50  0.00  1.85 

60  0.00  1.85 

70  0.00  0.00 

80  0.00  0.00 

90  0.00  0.00 

100  0.00  0.00 

110  0.00  0.00 

120  0.00  0.00 

130  0.00  0.00 

140  0.00  0.00 

150  0.00  0.00 

160  0.00  0.00 

170  0.00  0.00 

180  0.00  0.00 

190  0.00  0.00 

200  0.00  0.00 


5.56 

25.93 

1.85 

0.00 

0.00 

46.30 

3.70 

1.85 

7.41 

0.00 

0.00 

1.85 

70.37 

0.00 

0.00 

11.11 

0.00 

0.00 

9.26 

57.41 

7.41 

1.85 

7.41 

0.00 

1.85 

12.96 

57.41 

16.67 

0.00 

5.56 

0.00 

11.11 

7.41 

37.04 

38.89 

0.00 

1.85 

0.00 

42.59 

5.56 

25.93 

22.22 

0.00 

3.70 

0.00 

50.00 

1.85 

11.11 

31.48 

0.00 

0.00 

0.00 

59.26 

3.70 

9.26 

27.78 

0.00 

1.85 

0.00 

57.41 

0.00 

16.67 

24.07 

0.00 

0.00 

0.00 

81.48 

0.00 

3.70 

14.81 

0.00 

0.00 

0.00 

81.48 

1.85 

7.41 

9.26 

0.00 

0.00 

0.00 

87.04 

0.00 

7.41 

5.56 

0.00 

0.00 

0.00 

94.44 

0.00 

3.70 

1.85 

0.00 

0.00 

0.00 

90.74 

0.00 

3.70 

5.56 

0.00 

0.00 

0.00 

94.44 

0.00 

3.70 

1.85 

0.00 

0.00 

0.00 

92.59 

0.00 

5.56 

1.85 

0.00 

0.00 

0.00 

100.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

98.15 

0.00 

0.00 

1.85 

0.00 

0.00 

0.00 

96.30 

0.00 

3.70 

0.00 

0.00 

0.00 

0.00 

96.30 

0.00 

3.70 

0.00 

0.00 

0.00 

0.00 

98.15 

0.00 

0.00 

1.85 
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Table  C-9.     Results  for  the  tokens  that  preserved  the  end  portions  of 
the  hi  in  the  word  "said." 


ms 

bed 

dead 

led 

ed 

ped 

said 

0 

14.81 

0.00 

0.00 

40.74 

0.00 

0.00 

10 

7.41 

3.70 

1.85 

27.78 

0.00 

0.00 

20 

0.00 

11.11 

1.85 

11.11 

0.00 

0.00 

30 

1.85 

20.37 

0.00 

3.70 

0.00 

0.00 

40 

0.00 

9.26 

0.00 

1.85 

0.00 

1.85 

50 

0.00 

7.41 

0.00 

1.85 

0.00 

7.41 

60 

0.00 

0.00 

0.00 

3.70 

0.00 

16.67 

70 

0.00 

1.85 

0.00 

0.00 

0.00 

27.78 

80 

0.00 

0.00 

0.00 

0.00 

0.00 

50.00 

90 

0.00 

0.00 

0.00 

0.00 

0.00 

50.00 

100 

0.00 

0.00 

0.00 

0.00 

0.00 

75.93 

110 

0.00 

0.00 

0.00 

0.00 

0.00 

96.30 

120 

0.00 

0.00 

0.00 

0.00 

0.00 

87.04 

130 

0.00 

0.00 

0.00 

0.00 

0.00 

88.89 

140 

0.00 

0.00 

0.00 

0.00 

0.00 

96.30 

150 

0.00 

0.00 

0.00 

0.00 

0.00 

90.74 

160 

0.00 

0.00 

0.00 

0.00 

0.00 

96.30 

170 

0.00 

0.00 

0.00 

0.00 

0.00 

96.30 

180 

0.00 

0.00 

0.00 

0.00 

0.00 

96.30 

190 

0.00 

0.00 

0.00 

0.00 

0.00 

98.15 

200 

0.00 

0.00 

0.00 

0.00 

0.00 

92.59 

ted 


thed 


zed 


0.00  44.44  0.00 

1.85  57.41  0.00 

1.85  74.07  0.00 

20.37  48.15  5.56 

50.00  33.33  3.70 

44.44  18.52  20.37 

29.63  25.93  24.07 

25.93  18.52  25.93 

9.26  14.81  25.93 

12.96  22.22  14.81 

1.85  9.26  12.96 

0.00  3.70  0.00 

3.70  9.26  0.00 

0.00  7.41  3.70 

0.00  3.70  0.00 

1.85  5.56  1.85 

0.00  3.70  0.00 

0.00  3.70  0.00 

0.00  3.70  0.00 

0.00  1.85  0.00 

0.00  5.56  1.85 
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Table  C-10.  Results  for  the  tokens  that  preserved  the  beginning 
portions  of  the  /z/  in  the  word  "zed." 


ms       bed       dead 


led 


ed 


ped 


said 


ted 


thed 


zed 


0 

0.00 

14.81 

9.26 

24.07 

0.00 

0.00 

10 

0.00 

9.26 

18.52 

14.81 

0.00 

0.00 

20 

0.00 

7.41 

11.11 

12.96 

0.00 

0.00 

30 

0.00 

9.26 

20.37 

22.22 

0.00 

0.00 

40 

0.00 

3.70 

9.26 

11.11 

0.00 

0.00 

50 

0.00 

3.70 

9.26 

1.85 

0.00 

0.00 

60 

0.00 

1.85 

9.26 

3.70 

0.00 

1.85 

70 

0.00 

3.70 

9.26 

1.85 

0.00 

0.00 

80 

0.00 

1.85 

11.11 

0.00 

0.00 

0.00 

90 

0.00 

1.85 

5.56 

0.00 

0.00 

0.00 

100 

0.00 

0.00 

5.56 

0.00 

0.00 

0.00 

110 

0.00 

1.85 

7.41 

0.00 

0.00 

0.00 

120 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

130 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

140 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

150 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

160 

0.00 

1.85 

0.00 

0.00 

0.00 

0.00 

170 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

180 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00  42.59  9.26 

0.00  46.30  11.11 

0.00  61.11  7.41 

0.00  40.74  7.41 

0.00  62.96  12.96 

0.00  50.00  35.19 

1.85  24.07  57.41 

0.00  37.04  48.15 

0.00  29.63  57.41 

0.00  20.37  72.22 

0.00  11.11  83.33 

0.00  14.81  75.93 

0.00  1.85  98.15 

0.00  1.85  98.15 

0.00  0.00  100.00 

0.00  0.00  100.00 

0.00  0.00  98.15 

0.00  1.85  98.15 

0.00  0.00  100.00 
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Table  C-1 1 .  Results  for  the  tokens  that  preserved  the  middle  portions 
of  the  /z/  in  the  word  "zed." 


ms 


bed       dead 


0  0.00  9.26 

10  0.00  79.63 

20  0.00  62.96 

30  1.85  79.63 

40  0.00  3.70 

50  0.00  0.00 

60  0.00  1.85 

70  0.00  0.00 

80  0.00  0.00 

90  0.00  0.00 

100  0.00  0.00 

110  0.00  0.00 

120  0.00  0.00 

130  0.00  0.00 

140  0.00  0.00 

150  0.00  0.00 

160  0.00  1.85 

170  0.00  0.00 

180  0.00  0.00 


led 

ed 

ped 

said 

ted 

thed 

zed 

14.81 

25.93 

0.00 

0.00 

0.00 

38.89 

11.11 

1.85 

1.85 

0.00 

0.00 

1.85 

7.41 

7.41 

1.85 

0.00 

0.00 

0.00 

0.00 

14.81 

20.37 

0.00 

1.85 

0.00 

0.00 

0.00 

9.26 

7.41 

0.00 

0.00 

0.00 

0.00 

0.00 

3.70 

92.59 

1.85 

0.00 

0.00 

0.00 

0.00 

5.56 

92.59 

1.85 

0.00 

0.00 

0.00 

0.00 

5.56 

90.74 

0.00 

0.00 

0.00 

1.85 

0.00 

1.85 

96.30 

0.00 

0.00 

0.00 

0.00 

0.00 

1.85 

98.15 

0.00 

1.85 

0.00 

0.00 

0.00 

1.85 

96.30 

0.00 

0.00 

0.00 

1.85 

0.00 

1.85 

96.30 

0.00 

0.00 

0.00 

0.00 

0.00 

1.85 

98.15 

0.00 

0.00 

0.00 

0.00 

0.00 

1.85 

98.15 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

0.00 

0.00 

0.00 

0.00 

0.00 

1.85 

96.30 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

100.00 
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Table  C-12.  Results  for  the  tokens  that  preserved  the  end  portions  of 
the  /z/  in  the  word  "zed." 


ms 

bed 

dead 

led 

ed 

ped 

said 

0 

0.00 

9.26 

11.11 

24.07 

0.00 

0.00 

10 

0.00 

11.11 

11.11 

16.67 

0.00 

0.00 

20 

0.00 

5.56 

5.56 

5.56 

0.00 

0.00 

30 

0.00 

9.26 

9.26 

12.96 

0.00 

0.00 

40 

0.00 

7.41 

1.85 

0.00 

0.00 

0.00 

50 

0.00 

3.70 

1.85 

0.00 

0.00 

1.85 

60 

0.00 

7.41 

0.00 

0.00 

0.00 

0.00 

70 

0.00 

7.41 

0.00 

0.00 

0.00 

0.00 

80 

0.00 

3.70 

0.00 

0.00 

0.00 

0.00 

90 

0.00 

3.70 

0.00 

0.00 

0.00 

0.00 

100 

0.00 

3.70 

0.00 

0.00 

1.85 

1.85 

110 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

120 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

130 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

140 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

150 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

160 

0.00 

1.85 

0.00 

0.00 

0.00 

0.00 

170 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

180 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

ted  thed  zed 

0.00  44.44  11.11 

0.00  35.19  25.93 

0.00  25.93  57.41 

0.00  38.89  29.63 

0.00  22.22  68.52 

0.00  16.67  75.93 

0.00  9.26  83.33 

0.00  9.26  83.33 

0.00  7.41  88.89 

0.00  1.85  94.44 

0.00  7.41  85.19 

0.00  3.70  96.30 

0.00  0.00  100.00 

0.00  0.00  100.00 

0.00  0.00  100.00 

0.00  0.00  100.00 

0.00  1.85  96.30 

0.00  0.00  100.00 

0.00  0.00  100.00 
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